<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/block/blk-sysfs.c, branch linux-4.6.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros</title>
<updated>2016-04-04T17:41:08+00:00</updated>
<author>
<name>Kirill A. Shutemov</name>
<email>kirill.shutemov@linux.intel.com</email>
</author>
<published>2016-04-01T12:29:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=09cbfeaf1a5a67bfb3201e0c83c810cecb2efa5a'/>
<id>09cbfeaf1a5a67bfb3201e0c83c810cecb2efa5a</id>
<content type='text'>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - &lt;foo&gt; &lt;&lt; (PAGE_CACHE_SHIFT - PAGE_SHIFT) -&gt; &lt;foo&gt;;

 - &lt;foo&gt; &gt;&gt; (PAGE_CACHE_SHIFT - PAGE_SHIFT) -&gt; &lt;foo&gt;;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -&gt; PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -&gt; get_page();

 - page_cache_release() -&gt; put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E &lt;&lt; (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E &gt;&gt; (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - &lt;foo&gt; &lt;&lt; (PAGE_CACHE_SHIFT - PAGE_SHIFT) -&gt; &lt;foo&gt;;

 - &lt;foo&gt; &gt;&gt; (PAGE_CACHE_SHIFT - PAGE_SHIFT) -&gt; &lt;foo&gt;;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -&gt; PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -&gt; get_page();

 - page_cache_release() -&gt; put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E &lt;&lt; (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E &gt;&gt; (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>blk: fix overflow in queue_discard_max_hw_show</title>
<updated>2016-02-17T17:20:42+00:00</updated>
<author>
<name>Alan</name>
<email>gnomes@lxorguk.ukuu.org.uk</email>
</author>
<published>2016-02-17T14:15:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=18f922d037211a15543af935861bf92161e697e9'/>
<id>18f922d037211a15543af935861bf92161e697e9</id>
<content type='text'>
We get this right for queue_discard_max_show but not max_hw_show. Follow the
same pattern as queue_discard_max_show instead so that we don't truncate.

Signed-off-by: Alan Cox &lt;alan@linux.intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We get this right for queue_discard_max_show but not max_hw_show. Follow the
same pattern as queue_discard_max_show instead so that we don't truncate.

Signed-off-by: Alan Cox &lt;alan@linux.intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'mkp-fixes' into fixes</title>
<updated>2015-12-03T17:32:33+00:00</updated>
<author>
<name>James Bottomley</name>
<email>James.Bottomley@HansenPartnership.com</email>
</author>
<published>2015-12-03T17:32:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=be9e2f775f5a3147205c552534c3abf0f9374a13'/>
<id>be9e2f775f5a3147205c552534c3abf0f9374a13</id>
<content type='text'>
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
</pre>
</div>
</content>
</entry>
<entry>
<title>block/sd: Fix device-imposed transfer length limits</title>
<updated>2015-11-26T02:38:58+00:00</updated>
<author>
<name>Martin K. Petersen</name>
<email>martin.petersen@oracle.com</email>
</author>
<published>2015-11-13T21:46:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ca369d51b3e1649be4a72addd6d6a168cfb3f537'/>
<id>ca369d51b3e1649be4a72addd6d6a168cfb3f537</id>
<content type='text'>
Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
had the unfortunate side-effect of removing an implicit clamp to
BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
code. This caused problems for some SMR drives.

Debugging this issue revealed a few problems with the existing
infrastructure since the block layer didn't know how to deal with
device-imposed limits, only limits set by the I/O controller.

 - Introduce a new queue limit, max_dev_sectors, which is used by the
   ULD to signal the maximum sectors for a REQ_TYPE_FS request.

 - Ensure that max_dev_sectors is correctly stacked and taken into
   account when overriding max_sectors through sysfs.

 - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
   values for later processing.

 - In sd_revalidate() set the queue's max_dev_sectors based on the
   MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
   is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
   field size.

 - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
   VPD--if reported and sane--to signal the preferred device transfer
   size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

 - blk_limits_max_hw_sectors() is no longer used and can be removed.

Signed-off-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Tested-by: sweeneygj@gmx.com
Tested-by: Arzeets &lt;anatol.pomozov@gmail.com&gt;
Tested-by: David Eisner &lt;david.eisner@oriel.oxon.org&gt;
Tested-by: Mario Kicherer &lt;dev@kicherer.org&gt;
Signed-off-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
had the unfortunate side-effect of removing an implicit clamp to
BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
code. This caused problems for some SMR drives.

Debugging this issue revealed a few problems with the existing
infrastructure since the block layer didn't know how to deal with
device-imposed limits, only limits set by the I/O controller.

 - Introduce a new queue limit, max_dev_sectors, which is used by the
   ULD to signal the maximum sectors for a REQ_TYPE_FS request.

 - Ensure that max_dev_sectors is correctly stacked and taken into
   account when overriding max_sectors through sysfs.

 - Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
   values for later processing.

 - In sd_revalidate() set the queue's max_dev_sectors based on the
   MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
   is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
   field size.

 - In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
   VPD--if reported and sane--to signal the preferred device transfer
   size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

 - blk_limits_max_hw_sectors() is no longer used and can be removed.

Signed-off-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Tested-by: sweeneygj@gmx.com
Tested-by: Arzeets &lt;anatol.pomozov@gmail.com&gt;
Tested-by: David Eisner &lt;david.eisner@oriel.oxon.org&gt;
Tested-by: Mario Kicherer &lt;dev@kicherer.org&gt;
Signed-off-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: add block polling support</title>
<updated>2015-11-07T17:40:47+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2015-11-05T17:44:55+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=05229beeddf7e75e2e616ddaad4b70e7fca9528d'/>
<id>05229beeddf7e75e2e616ddaad4b70e7fca9528d</id>
<content type='text'>
Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.

This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Acked-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Keith Busch &lt;keith.busch@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.

This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.

Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
Acked-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Keith Busch &lt;keith.busch@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block</title>
<updated>2015-11-05T04:51:48+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2015-11-05T04:51:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=527d1529e38b36fd22e65711b653ab773179d9e8'/>
<id>527d1529e38b36fd22e65711b653ab773179d9e8</id>
<content type='text'>
Pull block integrity updates from Jens Axboe:
 ""This is the joint work of Dan and Martin, cleaning up and improving
  the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
  block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
  block: blk_flush_integrity() for bio-based drivers
  block: move blk_integrity to request_queue
  block: generic request_queue reference counting
  nvme: suspend i/o during runtime blk_integrity_unregister
  md: suspend i/o during runtime blk_integrity_unregister
  md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
  block: Inline blk_integrity in struct gendisk
  block: Export integrity data interval size in sysfs
  block: Reduce the size of struct blk_integrity
  block: Consolidate static integrity profile properties
  block: Move integrity kobject to struct gendisk
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull block integrity updates from Jens Axboe:
 ""This is the joint work of Dan and Martin, cleaning up and improving
  the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
  block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
  block: blk_flush_integrity() for bio-based drivers
  block: move blk_integrity to request_queue
  block: generic request_queue reference counting
  nvme: suspend i/o during runtime blk_integrity_unregister
  md: suspend i/o during runtime blk_integrity_unregister
  md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
  block: Inline blk_integrity in struct gendisk
  block: Export integrity data interval size in sysfs
  block: Reduce the size of struct blk_integrity
  block: Consolidate static integrity profile properties
  block: Move integrity kobject to struct gendisk
</pre>
</div>
</content>
</entry>
<entry>
<title>block: generic request_queue reference counting</title>
<updated>2015-10-21T20:43:41+00:00</updated>
<author>
<name>Dan Williams</name>
<email>dan.j.williams@intel.com</email>
</author>
<published>2015-10-21T17:20:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=3ef28e83ab15799742e55fd13243a5f678b04242'/>
<id>3ef28e83ab15799742e55fd13243a5f678b04242</id>
<content type='text'>
Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q-&gt;make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [&lt;ffffffff8145e8bf&gt;] ? copy_user_handle_tail+0x5f/0x70
  [&lt;ffffffffa004e1e0&gt;] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [&lt;ffffffffa004e331&gt;] pmem_make_request+0xd1/0x200 [nd_pmem]
  [&lt;ffffffff811c3162&gt;] ? mempool_alloc+0x72/0x1a0
  [&lt;ffffffff8141f8b6&gt;] generic_make_request+0xd6/0x110
  [&lt;ffffffff8141f966&gt;] submit_bio+0x76/0x170
  [&lt;ffffffff81286dff&gt;] submit_bh_wbc+0x12f/0x160
  [&lt;ffffffff81286e62&gt;] submit_bh+0x12/0x20
  [&lt;ffffffff813395bd&gt;] jbd2_write_superblock+0x8d/0x170
  [&lt;ffffffff8133974d&gt;] jbd2_mark_journal_empty+0x5d/0x90
  [&lt;ffffffff813399cb&gt;] jbd2_journal_destroy+0x24b/0x270
  [&lt;ffffffff810bc4ca&gt;] ? put_pwq_unlocked+0x2a/0x30
  [&lt;ffffffff810bc6f5&gt;] ? destroy_workqueue+0x225/0x250
  [&lt;ffffffff81303494&gt;] ext4_put_super+0x64/0x360
  [&lt;ffffffff8124ab1a&gt;] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Cc: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Suggested-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Tested-by: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q-&gt;make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [&lt;ffffffff8145e8bf&gt;] ? copy_user_handle_tail+0x5f/0x70
  [&lt;ffffffffa004e1e0&gt;] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [&lt;ffffffffa004e331&gt;] pmem_make_request+0xd1/0x200 [nd_pmem]
  [&lt;ffffffff811c3162&gt;] ? mempool_alloc+0x72/0x1a0
  [&lt;ffffffff8141f8b6&gt;] generic_make_request+0xd6/0x110
  [&lt;ffffffff8141f966&gt;] submit_bio+0x76/0x170
  [&lt;ffffffff81286dff&gt;] submit_bh_wbc+0x12f/0x160
  [&lt;ffffffff81286e62&gt;] submit_bh+0x12/0x20
  [&lt;ffffffff813395bd&gt;] jbd2_write_superblock+0x8d/0x170
  [&lt;ffffffff8133974d&gt;] jbd2_mark_journal_empty+0x5d/0x90
  [&lt;ffffffff813399cb&gt;] jbd2_journal_destroy+0x24b/0x270
  [&lt;ffffffff810bc4ca&gt;] ? put_pwq_unlocked+0x2a/0x30
  [&lt;ffffffff810bc6f5&gt;] ? destroy_workqueue+0x225/0x250
  [&lt;ffffffff81303494&gt;] ext4_put_super+0x64/0x360
  [&lt;ffffffff8124ab1a&gt;] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Keith Busch &lt;keith.busch@intel.com&gt;
Cc: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Suggested-by: Christoph Hellwig &lt;hch@lst.de&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Tested-by: Ross Zwisler &lt;ross.zwisler@linux.intel.com&gt;
Signed-off-by: Dan Williams &lt;dan.j.williams@intel.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: don't release bdi while request_queue has live references</title>
<updated>2015-10-15T15:53:28+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2015-09-08T16:20:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=b02176f30cd30acccd3b633ab7d9aed8b5da52ff'/>
<id>b02176f30cd30acccd3b633ab7d9aed8b5da52ff</id>
<content type='text'>
bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.

A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue.  As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine.  The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.

a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation.  Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy().  This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step.  To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy().  bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue().  bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.

While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara &lt;jack@suse.com&gt;
Reviewed-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.

A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue.  As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine.  The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.

a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation.  Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy().  This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step.  To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy().  bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue().  bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.

While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.

Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara &lt;jack@suse.com&gt;
Reviewed-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: make generic_make_request handle arbitrarily sized bios</title>
<updated>2015-08-13T18:31:33+00:00</updated>
<author>
<name>Kent Overstreet</name>
<email>kent.overstreet@gmail.com</email>
</author>
<published>2015-04-24T05:37:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=54efd50bfd873e2dbf784e0b21a8027ba4299a3e'/>
<id>54efd50bfd873e2dbf784e0b21a8027ba4299a3e</id>
<content type='text'>
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Ming Lei &lt;ming.lei@canonical.com&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Alasdair Kergon &lt;agk@redhat.com&gt;
Cc: Mike Snitzer &lt;snitzer@redhat.com&gt;
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg &lt;drbd-dev@lists.linbit.com&gt;
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina &lt;jkosina@suse.cz&gt;
Cc: Geoff Levand &lt;geoff@infradead.org&gt;
Cc: Jim Paris &lt;jim@jtan.com&gt;
Cc: Philip Kelleher &lt;pjk1939@linux.vnet.ibm.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Nitin Gupta &lt;ngupta@vflare.org&gt;
Cc: Oleg Drokin &lt;oleg.drokin@intel.com&gt;
Cc: Andreas Dilger &lt;andreas.dilger@intel.com&gt;
Acked-by: NeilBrown &lt;neilb@suse.de&gt; (for the 'md/md.c' bits)
Acked-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Reviewed-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Signed-off-by: Kent Overstreet &lt;kent.overstreet@gmail.com&gt;
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park &lt;dpark@posteo.net&gt;
Signed-off-by: Ming Lin &lt;ming.l@ssi.samsung.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them.  In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

 * nfhd_make_request (arch/m68k/emu/nfblock.c)
 * axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
 * simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
 * brd_make_request (ramdisk - drivers/block/brd.c)
 * mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
 * loop_make_request
 * null_queue_bio
 * bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Ming Lei &lt;ming.lei@canonical.com&gt;
Cc: Neil Brown &lt;neilb@suse.de&gt;
Cc: Alasdair Kergon &lt;agk@redhat.com&gt;
Cc: Mike Snitzer &lt;snitzer@redhat.com&gt;
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg &lt;drbd-dev@lists.linbit.com&gt;
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina &lt;jkosina@suse.cz&gt;
Cc: Geoff Levand &lt;geoff@infradead.org&gt;
Cc: Jim Paris &lt;jim@jtan.com&gt;
Cc: Philip Kelleher &lt;pjk1939@linux.vnet.ibm.com&gt;
Cc: Minchan Kim &lt;minchan@kernel.org&gt;
Cc: Nitin Gupta &lt;ngupta@vflare.org&gt;
Cc: Oleg Drokin &lt;oleg.drokin@intel.com&gt;
Cc: Andreas Dilger &lt;andreas.dilger@intel.com&gt;
Acked-by: NeilBrown &lt;neilb@suse.de&gt; (for the 'md/md.c' bits)
Acked-by: Mike Snitzer &lt;snitzer@redhat.com&gt;
Reviewed-by: Martin K. Petersen &lt;martin.petersen@oracle.com&gt;
Signed-off-by: Kent Overstreet &lt;kent.overstreet@gmail.com&gt;
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park &lt;dpark@posteo.net&gt;
Signed-off-by: Ming Lin &lt;ming.l@ssi.samsung.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>block: make /sys/block/&lt;dev&gt;/queue/discard_max_bytes writeable</title>
<updated>2015-07-17T14:41:53+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@fb.com</email>
</author>
<published>2015-07-16T15:14:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=0034af036554c39eefd14d835a8ec3496ac46712'/>
<id>0034af036554c39eefd14d835a8ec3496ac46712</id>
<content type='text'>
Lots of devices support huge discard sizes these days. Depending
on how the device handles them internally, huge discards can
introduce massive latencies (hundreds of msec) on the device side.

We have a sysfs file, discard_max_bytes, that advertises the max
hardware supported discard size. Make this writeable, and split
the settings into a soft and hard limit. This can be set from
'discard_granularity' and up to the hardware limit.

Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw
set limit.

Reviewed-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Lots of devices support huge discard sizes these days. Depending
on how the device handles them internally, huge discards can
introduce massive latencies (hundreds of msec) on the device side.

We have a sysfs file, discard_max_bytes, that advertises the max
hardware supported discard size. Make this writeable, and split
the settings into a soft and hard limit. This can be set from
'discard_granularity' and up to the hardware limit.

Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw
set limit.

Reviewed-by: Jeff Moyer &lt;jmoyer@redhat.com&gt;
Signed-off-by: Jens Axboe &lt;axboe@fb.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
