<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/block, branch linux-3.10.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>sg_write()/bsg_write() is not fit to be called under KERNEL_DS</title>
<updated>2017-06-20T12:03:25+00:00</updated>
<author>
<name>Al Viro</name>
<email>viro@zeniv.linux.org.uk</email>
</author>
<published>2016-12-16T18:42:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=823a2a0330f25ccea8ef64ee4a378e37ca51361c'/>
<id>823a2a0330f25ccea8ef64ee4a378e37ca51361c</id>
<content type='text'>
commit 128394eff343fc6d2f32172f03e24829539c5835 upstream.

Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those.  Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 128394eff343fc6d2f32172f03e24829539c5835 upstream.

Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those.  Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.

Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: fix del_gendisk() vs blkdev_ioctl crash</title>
<updated>2017-06-20T06:02:37+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2015-12-29T22:02:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=1dd3d3e635c3b9538d693a616b150ddeb3662963'/>
<id>1dd3d3e635c3b9538d693a616b150ddeb3662963</id>
<content type='text'>
commit ac34f15e0c6d2fd58480052b6985f6991fb53bcc upstream.

When tearing down a block device early in its lifetime, userspace may
still be performing discovery actions like blkdev_ioctl() to re-read
partitions.

The nvdimm_revalidate_disk() implementation depends on
disk-&gt;driverfs_dev to be valid at entry.  However, it is set to NULL in
del_gendisk() and fatally this is happening *before* the disk device is
deleted from userspace view.

There's no reason for del_gendisk() to clear -&gt;driverfs_dev.  That
device is the parent of the disk.  It is guaranteed to not be freed
until the disk, as a child, drops its -&gt;parent reference.

We could also fix this issue locally in nvdimm_revalidate_disk() by
using disk_to_dev(disk)-&gt;parent, but lets fix it globally since
-&gt;driverfs_dev follows the lifetime of the parent.  Longer term we
should probably just add a @parent parameter to add_disk(), and stop
carrying this pointer in the gendisk.

 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [&lt;ffffffffa00340a8&gt;] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
 CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
 [..]
 Call Trace:
  [&lt;ffffffff8143e5c7&gt;] rescan_partitions+0x87/0x2c0
  [&lt;ffffffff810f37f9&gt;] ? __lock_is_held+0x49/0x70
  [&lt;ffffffff81438c62&gt;] __blkdev_reread_part+0x72/0xb0
  [&lt;ffffffff81438cc5&gt;] blkdev_reread_part+0x25/0x40
  [&lt;ffffffff8143982d&gt;] blkdev_ioctl+0x4fd/0x9c0
  [&lt;ffffffff811246c9&gt;] ? current_kernel_time64+0x69/0xd0
  [&lt;ffffffff812916dd&gt;] block_ioctl+0x3d/0x50
  [&lt;ffffffff81264c38&gt;] do_vfs_ioctl+0x308/0x560
  [&lt;ffffffff8115dbd1&gt;] ? __audit_syscall_entry+0xb1/0x100
  [&lt;ffffffff810031d6&gt;] ? do_audit_syscall_entry+0x66/0x70
  [&lt;ffffffff81264f09&gt;] SyS_ioctl+0x79/0x90
  [&lt;ffffffff81902672&gt;] entry_SYSCALL_64_fastpath+0x12/0x76

Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jens Axboe &lt;axboe@fb.com&gt;
Reported-by: Robert Hu &lt;robert.hu@intel.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ac34f15e0c6d2fd58480052b6985f6991fb53bcc upstream.

When tearing down a block device early in its lifetime, userspace may
still be performing discovery actions like blkdev_ioctl() to re-read
partitions.

The nvdimm_revalidate_disk() implementation depends on
disk-&gt;driverfs_dev to be valid at entry.  However, it is set to NULL in
del_gendisk() and fatally this is happening *before* the disk device is
deleted from userspace view.

There's no reason for del_gendisk() to clear -&gt;driverfs_dev.  That
device is the parent of the disk.  It is guaranteed to not be freed
until the disk, as a child, drops its -&gt;parent reference.

We could also fix this issue locally in nvdimm_revalidate_disk() by
using disk_to_dev(disk)-&gt;parent, but lets fix it globally since
-&gt;driverfs_dev follows the lifetime of the parent.  Longer term we
should probably just add a @parent parameter to add_disk(), and stop
carrying this pointer in the gendisk.

 BUG: unable to handle kernel NULL pointer dereference at           (null)
 IP: [&lt;ffffffffa00340a8&gt;] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
 CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
 [..]
 Call Trace:
  [&lt;ffffffff8143e5c7&gt;] rescan_partitions+0x87/0x2c0
  [&lt;ffffffff810f37f9&gt;] ? __lock_is_held+0x49/0x70
  [&lt;ffffffff81438c62&gt;] __blkdev_reread_part+0x72/0xb0
  [&lt;ffffffff81438cc5&gt;] blkdev_reread_part+0x25/0x40
  [&lt;ffffffff8143982d&gt;] blkdev_ioctl+0x4fd/0x9c0
  [&lt;ffffffff811246c9&gt;] ? current_kernel_time64+0x69/0xd0
  [&lt;ffffffff812916dd&gt;] block_ioctl+0x3d/0x50
  [&lt;ffffffff81264c38&gt;] do_vfs_ioctl+0x308/0x560
  [&lt;ffffffff8115dbd1&gt;] ? __audit_syscall_entry+0xb1/0x100
  [&lt;ffffffff810031d6&gt;] ? do_audit_syscall_entry+0x66/0x70
  [&lt;ffffffff81264f09&gt;] SyS_ioctl+0x79/0x90
  [&lt;ffffffff81902672&gt;] entry_SYSCALL_64_fastpath+0x12/0x76

Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Jens Axboe &lt;axboe@fb.com&gt;
Reported-by: Robert Hu &lt;robert.hu@intel.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: allow WRITE_SAME commands with the SG_IO ioctl</title>
<updated>2017-06-20T06:02:37+00:00</updated>
<author>
<name>Mauricio Faria de Oliveira</name>
<email>mauricfo@linux.vnet.ibm.com</email>
</author>
<published>2017-03-25T16:18:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=5cb0174119e4743c9f4b8f17115a8a2386c0f4a2'/>
<id>5cb0174119e4743c9f4b8f17115a8a2386c0f4a2</id>
<content type='text'>
commit 25cdb64510644f3e854d502d69c73f21c6df88a9 upstream.

The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -&gt; blk_fill_sghdr_rq() &gt; blk_verify_command() -&gt; -EPERM ]

The problem can be reproduced with the sg_write_same command

  # sg_write_same --num 1 --xferlen 512 /dev/sda
  #

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
    Write same: pass through os error: Operation not permitted
  #

For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
  #

So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
  #

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "&lt;disk type='block' device='lun'&gt;" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

  [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
  [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
  [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
  [...] blk_update_request: I/O error, dev sda, sector 17096824

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -&gt; 'device')

Signed-off-by: Mauricio Faria de Oliveira &lt;mauricfo@linux.vnet.ibm.com&gt;
Signed-off-by: Brahadambal Srinivasan &lt;latha@linux.vnet.ibm.com&gt;
Reported-by: Manjunatha H R &lt;manjuhr1@in.ibm.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Sumit Semwal &lt;sumit.semwal@linaro.org&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 25cdb64510644f3e854d502d69c73f21c6df88a9 upstream.

The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -&gt; blk_fill_sghdr_rq() &gt; blk_verify_command() -&gt; -EPERM ]

The problem can be reproduced with the sg_write_same command

  # sg_write_same --num 1 --xferlen 512 /dev/sda
  #

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
    Write same: pass through os error: Operation not permitted
  #

For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
  #

So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
  #

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "&lt;disk type='block' device='lun'&gt;" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

  [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
  [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
  [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
  [...] blk_update_request: I/O error, dev sda, sector 17096824

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -&gt; 'device')

Signed-off-by: Mauricio Faria de Oliveira &lt;mauricfo@linux.vnet.ibm.com&gt;
Signed-off-by: Brahadambal Srinivasan &lt;latha@linux.vnet.ibm.com&gt;
Reported-by: Manjunatha H R &lt;manjuhr1@in.ibm.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
Signed-off-by: Sumit Semwal &lt;sumit.semwal@linaro.org&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>cfq: fix starvation of asynchronous writes</title>
<updated>2017-02-10T10:03:58+00:00</updated>
<author>
<name>Glauber Costa</name>
<email>glauber@scylladb.com</email>
</author>
<published>2016-09-23T00:59:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=5dd23e675ad85931045252de3d2688e88435b287'/>
<id>5dd23e675ad85931045252de3d2688e88435b287</id>
<content type='text'>
commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream.

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 &lt;- (8,1) 141696240
  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
  &lt;==== 'I' means Inserted (into the IO scheduler) ===================================&gt;
  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
  &lt;==== Only 15s later the CFQ scheduler dispatches the request ======================&gt;

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

  8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
 value |-------------------------------------------------- count
     0 |                                                     11
     1 |@@@@                                                161
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
     4 |@                                                    54
     8 |                                                     36
    16 |                                                      7
    32 |                                                      0
    64 |                                                      0
       ~
  1024 |                                                      0
  2048 |                                                      0
  4096 |                                                      1
  8192 |                                                      1
 16384 |                                                      2
 32768 |                                                      0
 65536 |                                                      0
131072 |                                                      1
262144 |                                                      0
524288 |                                                      0

Signed-off-by: Glauber Costa &lt;glauber@scylladb.com&gt;
CC: Jens Axboe &lt;axboe@kernel.dk&gt;
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Glauber Costa &lt;glauber@scylladb.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 3932a86b4b9d1f0b049d64d4591ce58ad18b44ec upstream.

While debugging timeouts happening in my application workload (ScyllaDB), I have
observed calls to open() taking a long time, ranging everywhere from 2 seconds -
the first ones that are enough to time out my application - to more than 30
seconds.

The problem seems to happen because XFS may block on pending metadata updates
under certain circumnstances, and that's confirmed with the following backtrace
taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
high "Dispatch wait" times, on the dozens of seconds range and consistent with
the open() times I have saw in that run.

Still from the blktrace output, we can after searching a bit, identify the
request that wasn't dispatched:

  8,0   11      152    81.092472813   804  A  WM 141698288 + 8 &lt;- (8,1) 141696240
  8,0   11      153    81.092472889   804  Q  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      154    81.092473207   804  G  WM 141698288 + 8 [xfsaild/sda1]
  8,0   11      206    81.092496118   804  I  WM 141698288 + 8 (   22911) [xfsaild/sda1]
  &lt;==== 'I' means Inserted (into the IO scheduler) ===================================&gt;
  8,0    0   289372    96.718761435     0  D  WM 141698288 + 8 (15626265317) [swapper/0]
  &lt;==== Only 15s later the CFQ scheduler dispatches the request ======================&gt;

As we can see above, in this particular example CFQ took 15 seconds to dispatch
this request. Going back to the full trace, we can see that the xfsaild queue
had plenty of opportunity to run, and it was selected as the active queue many
times. It would just always be preempted by something else (example):

  8,0    1        0    81.117912979     0  m   N cfq1618SN / insert_request
  8,0    1        0    81.117913419     0  m   N cfq1618SN / add_to_rr
  8,0    1        0    81.117914044     0  m   N cfq1618SN / preempt
  8,0    1        0    81.117914398     0  m   N cfq767A  / slice expired t=1
  8,0    1        0    81.117914755     0  m   N cfq767A  / resid=40
  8,0    1        0    81.117915340     0  m   N / served: vt=1948520448 min_vt=1948520448
  8,0    1        0    81.117915858     0  m   N cfq767A  / sl_used=1 disp=0 charge=0 iops=1 sect=0

where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
IO dispatchers.

The requests preempting the xfsaild queue are synchronous requests. That's a
characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
While it can be argued that preempting ASYNC requests in favor of SYNC is part
of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
goal.

Moreover, unless I am misunderstanding something, that breaks the expectation
set by the "fifo_expire_async" tunable, which in my system is set to the
default.

Looking at the code, it seems to me that the issue is that after we make
an async queue active, there is no guarantee that it will execute any request.

When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
requests in flight. An incoming request from another queue can also preempt it
in such situation before we have the chance to execute anything (as seen in the
trace above).

This patch sets the must_dispatch flag if we notice that we have requests
that are already fifo_expired. This flag is always cleared after
cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
the queue for subsequent requests (unless they are themselves expired)

Care is taken during preempt to still allow rt requests to preempt us
regardless.

Testing my workload with this patch applied produces much better results.
From the application side I see no timeouts, and the open() latency histogram
generated by systemtap looks much better, with the worst outlier at 131ms:

Latency histogram of xfs_buf_lock acquisition (microseconds):
 value |-------------------------------------------------- count
     0 |                                                     11
     1 |@@@@                                                161
     2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  1966
     4 |@                                                    54
     8 |                                                     36
    16 |                                                      7
    32 |                                                      0
    64 |                                                      0
       ~
  1024 |                                                      0
  2048 |                                                      0
  4096 |                                                      1
  8192 |                                                      1
 16384 |                                                      2
 32768 |                                                      0
 65536 |                                                      0
131072 |                                                      1
262144 |                                                      0
524288 |                                                      0

Signed-off-by: Glauber Costa &lt;glauber@scylladb.com&gt;
CC: Jens Axboe &lt;axboe@kernel.dk&gt;
CC: linux-block@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Glauber Costa &lt;glauber@scylladb.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: fix use-after-free in seq file</title>
<updated>2016-08-27T09:40:34+00:00</updated>
<author>
<name>Vegard Nossum</name>
<email>vegard.nossum@oracle.com</email>
</author>
<published>2016-07-29T08:40:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=23cf0b7eeda4777d0bac40f05b3ce3c62e34c957'/>
<id>23cf0b7eeda4777d0bac40f05b3ce3c62e34c957</id>
<content type='text'>
commit 77da160530dd1dc94f6ae15a981f24e5f0021e84 upstream.

I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
            ___slab_alloc+0x4f1/0x520
            __slab_alloc.isra.58+0x56/0x80
            kmem_cache_alloc_trace+0x260/0x2a0
            disk_seqf_start+0x66/0x110
            traverse+0x176/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
            __slab_free+0x17a/0x2c0
            kfree+0x20a/0x220
            disk_seqf_stop+0x42/0x50
            traverse+0x3b5/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
     ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
     ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
     ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
     [&lt;ffffffff81d6ce81&gt;] dump_stack+0x65/0x84
     [&lt;ffffffff8146c7bd&gt;] print_trailer+0x10d/0x1a0
     [&lt;ffffffff814704ff&gt;] object_err+0x2f/0x40
     [&lt;ffffffff814754d1&gt;] kasan_report_error+0x221/0x520
     [&lt;ffffffff8147590e&gt;] __asan_report_load8_noabort+0x3e/0x40
     [&lt;ffffffff83888161&gt;] klist_iter_exit+0x61/0x70
     [&lt;ffffffff82404389&gt;] class_dev_iter_exit+0x9/0x10
     [&lt;ffffffff81d2e8ea&gt;] disk_seqf_stop+0x3a/0x50
     [&lt;ffffffff8151f812&gt;] seq_read+0x4b2/0x11a0
     [&lt;ffffffff815f8fdc&gt;] proc_reg_read+0xbc/0x180
     [&lt;ffffffff814b24e4&gt;] do_loop_readv_writev+0x134/0x210
     [&lt;ffffffff814b4c45&gt;] do_readv_writev+0x565/0x660
     [&lt;ffffffff814b8a17&gt;] vfs_readv+0x67/0xa0
     [&lt;ffffffff814b8de6&gt;] do_preadv+0x126/0x170
     [&lt;ffffffff814b92ec&gt;] SyS_preadv+0xc/0x10

This problem can occur in the following situation:

open()
 - pread()
    - .seq_start()
       - iter = kmalloc() // succeeds
       - seqf-&gt;private = iter
    - .seq_stop()
       - kfree(seqf-&gt;private)
 - pread()
    - .seq_start()
       - iter = kmalloc() // fails
    - .seq_stop()
       - class_dev_iter_exit(seqf-&gt;private) // boom! old pointer

As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.

An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 77da160530dd1dc94f6ae15a981f24e5f0021e84 upstream.

I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
            ___slab_alloc+0x4f1/0x520
            __slab_alloc.isra.58+0x56/0x80
            kmem_cache_alloc_trace+0x260/0x2a0
            disk_seqf_start+0x66/0x110
            traverse+0x176/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
            __slab_free+0x17a/0x2c0
            kfree+0x20a/0x220
            disk_seqf_stop+0x42/0x50
            traverse+0x3b5/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
     ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
     ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
     ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
     [&lt;ffffffff81d6ce81&gt;] dump_stack+0x65/0x84
     [&lt;ffffffff8146c7bd&gt;] print_trailer+0x10d/0x1a0
     [&lt;ffffffff814704ff&gt;] object_err+0x2f/0x40
     [&lt;ffffffff814754d1&gt;] kasan_report_error+0x221/0x520
     [&lt;ffffffff8147590e&gt;] __asan_report_load8_noabort+0x3e/0x40
     [&lt;ffffffff83888161&gt;] klist_iter_exit+0x61/0x70
     [&lt;ffffffff82404389&gt;] class_dev_iter_exit+0x9/0x10
     [&lt;ffffffff81d2e8ea&gt;] disk_seqf_stop+0x3a/0x50
     [&lt;ffffffff8151f812&gt;] seq_read+0x4b2/0x11a0
     [&lt;ffffffff815f8fdc&gt;] proc_reg_read+0xbc/0x180
     [&lt;ffffffff814b24e4&gt;] do_loop_readv_writev+0x134/0x210
     [&lt;ffffffff814b4c45&gt;] do_readv_writev+0x565/0x660
     [&lt;ffffffff814b8a17&gt;] vfs_readv+0x67/0xa0
     [&lt;ffffffff814b8de6&gt;] do_preadv+0x126/0x170
     [&lt;ffffffff814b92ec&gt;] SyS_preadv+0xc/0x10

This problem can occur in the following situation:

open()
 - pread()
    - .seq_start()
       - iter = kmalloc() // succeeds
       - seqf-&gt;private = iter
    - .seq_stop()
       - kfree(seqf-&gt;private)
 - pread()
    - .seq_start()
       - iter = kmalloc() // fails
    - .seq_stop()
       - class_dev_iter_exit(seqf-&gt;private) // boom! old pointer

As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.

An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum &lt;vegard.nossum@oracle.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mac: validate mac_partition is within sector</title>
<updated>2016-03-03T23:06:21+00:00</updated>
<author>
<name>Kees Cook</name>
<email>keescook@chromium.org</email>
</author>
<published>2015-11-20T01:18:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2a27f61bd411e564eb4651c18d225f6e9e1de534'/>
<id>2a27f61bd411e564eb4651c18d225f6e9e1de534</id>
<content type='text'>
commit 02e2a5bfebe99edcf9d694575a75032d53fe1b73 upstream.

If md-&gt;signature == MAC_DRIVER_MAGIC and md-&gt;block_size == 1023, a single
512 byte sector would be read (secsize / 512). However the partition
structure would be located past the end of the buffer (secsize % 512).

Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 02e2a5bfebe99edcf9d694575a75032d53fe1b73 upstream.

If md-&gt;signature == MAC_DRIVER_MAGIC and md-&gt;block_size == 1023, a single
512 byte sector would be read (secsize / 512). However the partition
structure would be located past the end of the buffer (secsize % 512).

Signed-off-by: Kees Cook &lt;keescook@chromium.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>SCSI: Fix NULL pointer dereference in runtime PM</title>
<updated>2016-02-25T19:57:47+00:00</updated>
<author>
<name>Ken Xue</name>
<email>ken.xue@amd.com</email>
</author>
<published>2015-12-01T06:45:46+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2bfa7bba55afcfbd887de08f936c3186ca2fae82'/>
<id>2bfa7bba55afcfbd887de08f936c3186ca2fae82</id>
<content type='text'>
commit 4fd41a8552afc01054d9d9fc7f1a63c324867d27 upstream.

The routines in scsi_pm.c assume that if a runtime-PM callback is
invoked for a SCSI device, it can only mean that the device's driver
has asked the block layer to handle the runtime power management (by
calling blk_pm_runtime_init(), which among other things sets q-&gt;dev).

However, this assumption turns out to be wrong for things like the ses
driver.  Normally ses devices are not allowed to do runtime PM, but
userspace can override this setting.  If this happens, the kernel gets
a NULL pointer dereference when blk_post_runtime_resume() tries to use
the uninitialized q-&gt;dev pointer.

This patch fixes the problem by checking q-&gt;dev in block layer before
handle runtime PM. Since ses doesn't define any PM callbacks and call
blk_pm_runtime_init(), the crash won't occur.

This fixes Bugzilla #101371.
https://bugzilla.kernel.org/show_bug.cgi?id=101371

More discussion can be found from below link.
http://marc.info/?l=linux-scsi&amp;m=144163730531875&amp;w=2

Signed-off-by: Ken Xue &lt;Ken.Xue@amd.com&gt;
Acked-by: Alan Stern &lt;stern@rowland.harvard.edu&gt;
Cc: Xiangliang Yu &lt;Xiangliang.Yu@amd.com&gt;
Cc: James E.J. Bottomley &lt;JBottomley@odin.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Michael Terry &lt;Michael.terry@canonical.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 4fd41a8552afc01054d9d9fc7f1a63c324867d27 upstream.

The routines in scsi_pm.c assume that if a runtime-PM callback is
invoked for a SCSI device, it can only mean that the device's driver
has asked the block layer to handle the runtime power management (by
calling blk_pm_runtime_init(), which among other things sets q-&gt;dev).

However, this assumption turns out to be wrong for things like the ses
driver.  Normally ses devices are not allowed to do runtime PM, but
userspace can override this setting.  If this happens, the kernel gets
a NULL pointer dereference when blk_post_runtime_resume() tries to use
the uninitialized q-&gt;dev pointer.

This patch fixes the problem by checking q-&gt;dev in block layer before
handle runtime PM. Since ses doesn't define any PM callbacks and call
blk_pm_runtime_init(), the crash won't occur.

This fixes Bugzilla #101371.
https://bugzilla.kernel.org/show_bug.cgi?id=101371

More discussion can be found from below link.
http://marc.info/?l=linux-scsi&amp;m=144163730531875&amp;w=2

Signed-off-by: Ken Xue &lt;Ken.Xue@amd.com&gt;
Acked-by: Alan Stern &lt;stern@rowland.harvard.edu&gt;
Cc: Xiangliang Yu &lt;Xiangliang.Yu@amd.com&gt;
Cc: James E.J. Bottomley &lt;JBottomley@odin.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Michael Terry &lt;Michael.terry@canonical.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>blkcg: fix gendisk reference leak in blkg_conf_prep()</title>
<updated>2015-08-10T19:20:30+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-07-22T22:05:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=74f412afcff28566ef240dc2a0f0af3994201fad'/>
<id>74f412afcff28566ef240dc2a0f0af3994201fad</id>
<content type='text'>
commit 5f6c2d2b7dbb541c1e922538c49fa04c494ae3d7 upstream.

When a blkcg configuration is targeted to a partition rather than a
whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it
forgets to put the gendisk ref in that case.  Fix it.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 5f6c2d2b7dbb541c1e922538c49fa04c494ae3d7 upstream.

When a blkcg configuration is targeted to a partition rather than a
whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it
forgets to put the gendisk ref in that case.  Fix it.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>block: fix ext_dev_lock lockdep report</title>
<updated>2015-06-22T23:55:52+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2015-06-11T03:47:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=96ebd8584aaa9b0f4eac8a922766c900a7495426'/>
<id>96ebd8584aaa9b0f4eac8a922766c900a7495426</id>
<content type='text'>
commit 4d66e5e9b6d720d8463e11d027bd4ad91c8b1318 upstream.

 =================================
 [ INFO: inconsistent lock state ]
 4.1.0-rc7+ #217 Tainted: G           O
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -&gt; {IN-SOFTIRQ-W} usage.
 swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (ext_devt_lock){+.?...}, at: [&lt;ffffffff8143a60c&gt;] blk_free_devt+0x3c/0x70
 {SOFTIRQ-ON-W} state was registered at:
   [&lt;ffffffff810bf6b1&gt;] __lock_acquire+0x461/0x1e70
   [&lt;ffffffff810c1947&gt;] lock_acquire+0xb7/0x290
   [&lt;ffffffff818ac3a8&gt;] _raw_spin_lock+0x38/0x50
   [&lt;ffffffff8143a07d&gt;] blk_alloc_devt+0x6d/0xd0  &lt;-- take the lock in process context
[..]
  [&lt;ffffffff810bf64e&gt;] __lock_acquire+0x3fe/0x1e70
  [&lt;ffffffff810c00ad&gt;] ? __lock_acquire+0xe5d/0x1e70
  [&lt;ffffffff810c1947&gt;] lock_acquire+0xb7/0x290
  [&lt;ffffffff8143a60c&gt;] ? blk_free_devt+0x3c/0x70
  [&lt;ffffffff818ac3a8&gt;] _raw_spin_lock+0x38/0x50
  [&lt;ffffffff8143a60c&gt;] ? blk_free_devt+0x3c/0x70
  [&lt;ffffffff8143a60c&gt;] blk_free_devt+0x3c/0x70    &lt;-- take the lock in softirq
  [&lt;ffffffff8143bfec&gt;] part_release+0x1c/0x50
  [&lt;ffffffff8158edf6&gt;] device_release+0x36/0xb0
  [&lt;ffffffff8145ac2b&gt;] kobject_cleanup+0x7b/0x1a0
  [&lt;ffffffff8145aad0&gt;] kobject_put+0x30/0x70
  [&lt;ffffffff8158f147&gt;] put_device+0x17/0x20
  [&lt;ffffffff8143c29c&gt;] delete_partition_rcu_cb+0x16c/0x180
  [&lt;ffffffff8143c130&gt;] ? read_dev_sector+0xa0/0xa0
  [&lt;ffffffff810e0e0f&gt;] rcu_process_callbacks+0x2ff/0xa90
  [&lt;ffffffff810e0dcf&gt;] ? rcu_process_callbacks+0x2bf/0xa90
  [&lt;ffffffff81067e2e&gt;] __do_softirq+0xde/0x600

Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests.  This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block:
Fix dev_t minor allocation lifetime".  Both this and 2da78092dda1 are
candidates for -stable.

Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Reported-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 4d66e5e9b6d720d8463e11d027bd4ad91c8b1318 upstream.

 =================================
 [ INFO: inconsistent lock state ]
 4.1.0-rc7+ #217 Tainted: G           O
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -&gt; {IN-SOFTIRQ-W} usage.
 swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (ext_devt_lock){+.?...}, at: [&lt;ffffffff8143a60c&gt;] blk_free_devt+0x3c/0x70
 {SOFTIRQ-ON-W} state was registered at:
   [&lt;ffffffff810bf6b1&gt;] __lock_acquire+0x461/0x1e70
   [&lt;ffffffff810c1947&gt;] lock_acquire+0xb7/0x290
   [&lt;ffffffff818ac3a8&gt;] _raw_spin_lock+0x38/0x50
   [&lt;ffffffff8143a07d&gt;] blk_alloc_devt+0x6d/0xd0  &lt;-- take the lock in process context
[..]
  [&lt;ffffffff810bf64e&gt;] __lock_acquire+0x3fe/0x1e70
  [&lt;ffffffff810c00ad&gt;] ? __lock_acquire+0xe5d/0x1e70
  [&lt;ffffffff810c1947&gt;] lock_acquire+0xb7/0x290
  [&lt;ffffffff8143a60c&gt;] ? blk_free_devt+0x3c/0x70
  [&lt;ffffffff818ac3a8&gt;] _raw_spin_lock+0x38/0x50
  [&lt;ffffffff8143a60c&gt;] ? blk_free_devt+0x3c/0x70
  [&lt;ffffffff8143a60c&gt;] blk_free_devt+0x3c/0x70    &lt;-- take the lock in softirq
  [&lt;ffffffff8143bfec&gt;] part_release+0x1c/0x50
  [&lt;ffffffff8158edf6&gt;] device_release+0x36/0xb0
  [&lt;ffffffff8145ac2b&gt;] kobject_cleanup+0x7b/0x1a0
  [&lt;ffffffff8145aad0&gt;] kobject_put+0x30/0x70
  [&lt;ffffffff8158f147&gt;] put_device+0x17/0x20
  [&lt;ffffffff8143c29c&gt;] delete_partition_rcu_cb+0x16c/0x180
  [&lt;ffffffff8143c130&gt;] ? read_dev_sector+0xa0/0xa0
  [&lt;ffffffff810e0e0f&gt;] rcu_process_callbacks+0x2ff/0xa90
  [&lt;ffffffff810e0dcf&gt;] ? rcu_process_callbacks+0x2bf/0xa90
  [&lt;ffffffff81067e2e&gt;] __do_softirq+0xde/0x600

Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests.  This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block:
Fix dev_t minor allocation lifetime".  Both this and 2da78092dda1 are
candidates for -stable.

Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Reported-by: NeilBrown &lt;neilb@suse.de&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>blk-throttle: check stats_cpu before reading it from sysfs</title>
<updated>2015-03-06T22:40:54+00:00</updated>
<author>
<name>Thadeu Lima de Souza Cascardo</name>
<email>cascardo@linux.vnet.ibm.com</email>
</author>
<published>2015-02-16T19:16:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=65e63ea91b18e634c53e68a846608fcf8d571418'/>
<id>65e63ea91b18e634c53e68a846608fcf8d571418</id>
<content type='text'>
commit 045c47ca306acf30c740c285a77a4b4bda6be7c5 upstream.

When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.

Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.

[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
[ 1137.734230] MSR: 9000000000009032 &lt;SF,HV,EE,ME,IR,DR,RI&gt;  CR: 42008884  XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 &lt;7cead02a&gt; e9090008 e9490010 e9290018

And here is one code that allows to easily reproduce this, although this
has first been found by running docker.

void run(pid_t pid)
{
	int n;
	int status;
	int fd;
	char *buffer;
	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
	fd = open(CGPATH "/test/tasks", O_WRONLY);
	write(fd, buffer, n);
	close(fd);
	if (fork() &gt; 0) {
		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
		read(fd, buffer, 512);
		close(fd);
		wait(&amp;status);
	} else {
		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
		n = read(fd, buffer, BUFFER_SIZE);
		close(fd);
	}
	free(buffer);
	exit(0);
}

void test(void)
{
	int status;
	mkdir(CGPATH "/test", 0666);
	if (fork() &gt; 0)
		wait(&amp;status);
	else
		run(getpid());
	rmdir(CGPATH "/test");
}

int main(int argc, char **argv)
{
	int i;
	for (i = 0; i &lt; NR_TESTS; i++)
		test();
	return 0;
}

Reported-by: Ricardo Marin Matinata &lt;rmm@br.ibm.com&gt;
Signed-off-by: Thadeu Lima de Souza Cascardo &lt;cascardo@linux.vnet.ibm.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 045c47ca306acf30c740c285a77a4b4bda6be7c5 upstream.

When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.

Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.

[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300   Not tainted  (3.19.0)
[ 1137.734230] MSR: 9000000000009032 &lt;SF,HV,EE,ME,IR,DR,RI&gt;  CR: 42008884  XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 &lt;7cead02a&gt; e9090008 e9490010 e9290018

And here is one code that allows to easily reproduce this, although this
has first been found by running docker.

void run(pid_t pid)
{
	int n;
	int status;
	int fd;
	char *buffer;
	buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
	n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
	fd = open(CGPATH "/test/tasks", O_WRONLY);
	write(fd, buffer, n);
	close(fd);
	if (fork() &gt; 0) {
		fd = open("/dev/sda", O_RDONLY | O_DIRECT);
		read(fd, buffer, 512);
		close(fd);
		wait(&amp;status);
	} else {
		fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
		n = read(fd, buffer, BUFFER_SIZE);
		close(fd);
	}
	free(buffer);
	exit(0);
}

void test(void)
{
	int status;
	mkdir(CGPATH "/test", 0666);
	if (fork() &gt; 0)
		wait(&amp;status);
	else
		run(getpid());
	rmdir(CGPATH "/test");
}

int main(int argc, char **argv)
{
	int i;
	for (i = 0; i &lt; NR_TESTS; i++)
		test();
	return 0;
}

Reported-by: Ricardo Marin Matinata &lt;rmm@br.ibm.com&gt;
Signed-off-by: Thadeu Lima de Souza Cascardo &lt;cascardo@linux.vnet.ibm.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
