<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/drivers/block/rbd.c, branch v3.16</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>rbd: handle parent_overlap on writes correctly</title>
<updated>2014-06-23T08:55:37+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>ilya.dryomov@inktank.com</email>
</author>
<published>2014-06-10T09:53:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=9638556a276125553549fdfe349c464481ec2f39'/>
<id>9638556a276125553549fdfe349c464481ec2f39</id>
<content type='text'>
The following check in rbd_img_obj_request_submit()

    rbd_dev-&gt;parent_overlap &lt;= obj_request-&gt;img_offset

allows the fall through to the non-layered write case even if both
parent_overlap and obj_request-&gt;img_offset belong to the same RADOS
object.  This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data.  Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:

    rbd_data.&lt;id&gt;.0000000000000001
     ---------------------|----------------------|------------
    | should be copyup'ed | should be zeroed out | write ...
     ---------------------|----------------------|------------
   4M                    5M                     6M
                    parent_overlap    obj_request-&gt;img_offset

4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.

Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The following check in rbd_img_obj_request_submit()

    rbd_dev-&gt;parent_overlap &lt;= obj_request-&gt;img_offset

allows the fall through to the non-layered write case even if both
parent_overlap and obj_request-&gt;img_offset belong to the same RADOS
object.  This leads to data corruption, because the area to the left of
parent_overlap ends up unconditionally zero-filled instead of being
populated with parent data.  Suppose we want to write 1M to offset 6M
of image bar, which is a clone of foo@snap; object_size is 4M,
parent_overlap is 5M:

    rbd_data.&lt;id&gt;.0000000000000001
     ---------------------|----------------------|------------
    | should be copyup'ed | should be zeroed out | write ...
     ---------------------|----------------------|------------
   4M                    5M                     6M
                    parent_overlap    obj_request-&gt;img_offset

4..5M should be copyup'ed from foo, yet it is zero-filled, just like
5..6M is.

Given that the only striping mode kernel client currently supports is
chunking (i.e. stripe_unit == object_size, stripe_count == 1), round
parent_overlap up to the next object boundary for the purposes of the
overlap check.

Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: only set disk to read-only once</title>
<updated>2014-06-11T01:09:25+00:00</updated>
<author>
<name>Josh Durgin</name>
<email>josh.durgin@inktank.com</email>
</author>
<published>2013-10-01T03:10:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=22001f619f29ddf66582d834223dcff4c0b74595'/>
<id>22001f619f29ddf66582d834223dcff4c0b74595</id>
<content type='text'>
rbd_open(), called every time the device is opened, calls
set_device_ro().  There's no reason to set the device read-only or
read-write every time it is opened. Just do this once during device
setup, using set_disk_ro() instead because the struct block_device
isn't available to us there.

Signed-off-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
rbd_open(), called every time the device is opened, calls
set_device_ro().  There's no reason to set the device read-only or
read-write every time it is opened. Just do this once during device
setup, using set_disk_ro() instead because the struct block_device
isn't available to us there.

Signed-off-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: move calls that may sleep out of spin lock range</title>
<updated>2014-06-11T01:09:25+00:00</updated>
<author>
<name>Josh Durgin</name>
<email>josh.durgin@inktank.com</email>
</author>
<published>2013-10-01T00:09:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=77f33c03739697d01c2e730e4c2610424059ceaf'/>
<id>77f33c03739697d01c2e730e4c2610424059ceaf</id>
<content type='text'>
get_user() and set_disk_ro() may allocate memory, leading to a
potential deadlock if theye are called while a spin lock is held.

Move the acquisition and release of rbd_dev-&gt;lock from rbd_ioctl()
into rbd_ioctl_set_ro(), so it can occur between get_user() and
set_disk_ro().

Signed-off-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
get_user() and set_disk_ro() may allocate memory, leading to a
potential deadlock if theye are called while a spin lock is held.

Move the acquisition and release of rbd_dev-&gt;lock from rbd_ioctl()
into rbd_ioctl_set_ro(), so it can occur between get_user() and
set_disk_ro().

Signed-off-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: add ioctl for rbd</title>
<updated>2014-06-11T01:09:24+00:00</updated>
<author>
<name>Guangliang Zhao</name>
<email>guangliang@unitedstack.com</email>
</author>
<published>2013-09-24T03:25:36+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=131fd9f6fc89ad2cc993f80664d18ca49d6f8483'/>
<id>131fd9f6fc89ad2cc993f80664d18ca49d6f8483</id>
<content type='text'>
When running the following commands:
    [root@ceph0 mnt]# blockdev --setro /dev/rbd1
    [root@ceph0 mnt]# blockdev --getro /dev/rbd1
    0

The block setro didn't take effect, it is because
the rbd doesn't support ioctl of block driver.

This resolves:
	http://tracker.ceph.com/issues/6265

Signed-off-by: Guangliang Zhao &lt;guangliang@unitedstack.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When running the following commands:
    [root@ceph0 mnt]# blockdev --setro /dev/rbd1
    [root@ceph0 mnt]# blockdev --getro /dev/rbd1
    0

The block setro didn't take effect, it is because
the rbd doesn't support ioctl of block driver.

This resolves:
	http://tracker.ceph.com/issues/6265

Signed-off-by: Guangliang Zhao &lt;guangliang@unitedstack.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: fix ida/idr memory leak</title>
<updated>2014-06-06T01:30:00+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>ilya.dryomov@inktank.com</email>
</author>
<published>2014-05-20T11:46:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ffe312cf31c7d8616096616d469eb5f6bb8905c0'/>
<id>ffe312cf31c7d8616096616d469eb5f6bb8905c0</id>
<content type='text'>
ida_destroy() needs to be called on module exit to release ida caches.

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
ida_destroy() needs to be called on module exit to release ida caches.

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: use reference counts for image requests</title>
<updated>2014-06-06T01:29:59+00:00</updated>
<author>
<name>Alex Elder</name>
<email>elder@linaro.org</email>
</author>
<published>2014-04-26T10:21:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=0f2d5be792b0466b06797f637cfbb0f64dbb408c'/>
<id>0f2d5be792b0466b06797f637cfbb0f64dbb408c</id>
<content type='text'>
Each image request contains a reference count, but to date it has
not actually been used.  (I think this was just an oversight.) A
recent report involving rbd failing an assertion shed light on why
and where we need to use these reference counts.

Every OSD request associated with an object request uses
rbd_osd_req_callback() as its callback function.  That function will
call a helper function (dependent on the type of OSD request) that
will set the object request's "done" flag if the object request if
appropriate.  If that "done" flag is set, the object request is
passed to rbd_obj_request_complete().

In rbd_obj_request_complete(), requests are processed in sequential
order.  So if an object request completes before one of its
predecessors in the image request, the completion is deferred.
Otherwise, if it's a completing object's "turn" to be completed, it
is passed to rbd_img_obj_end_request(), which records the result of
the operation, accumulates transferred bytes, and so on.  Next, the
successor to this request is checked and if it is marked "done",
(deferred) completion processing is performed on that request, and
so on.  If the last object request in an image request is completed,
rbd_img_request_complete() is called, which (typically) destroys
the image request.

There is a race here, however.  The instant an object request is
marked "done" it can be provided (by a thread handling completion of
one of its predecessor operations) to rbd_img_obj_end_request(),
which (for the last request) can then lead to the image request
getting torn down.  And this can happen *before* that object has
itself entered rbd_img_obj_end_request().  As a result, once it
*does* enter that function, the image request (and even the object
request itself) may have been freed and become invalid.

All that's necessary to avoid this is to properly count references
to the image requests.  We tear down an image request's object
requests all at once--only when the entire image request has
completed.  So there's no need for an image request to count
references for its object requests.  However, we don't want an
image request to go away until the last of its object requests
has passed through rbd_img_obj_callback().  In other words,
we don't want rbd_img_request_complete() to necessarily
result in the image request being destroyed, because it may
get called before we've finished processing on all of its
object requests.

So the fix is to add a reference to an image request for
each of its object requests.  The reference can be viewed
as representing an object request that has not yet finished
its call to rbd_img_obj_callback().  That is emphasized by
getting the reference right after assigning that as the image
object's callback function.  The corresponding release of that
reference is done at the end of rbd_img_obj_callback(), which
every image object request passes through exactly once.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Elder &lt;elder@linaro.org&gt;
Reviewed-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Each image request contains a reference count, but to date it has
not actually been used.  (I think this was just an oversight.) A
recent report involving rbd failing an assertion shed light on why
and where we need to use these reference counts.

Every OSD request associated with an object request uses
rbd_osd_req_callback() as its callback function.  That function will
call a helper function (dependent on the type of OSD request) that
will set the object request's "done" flag if the object request if
appropriate.  If that "done" flag is set, the object request is
passed to rbd_obj_request_complete().

In rbd_obj_request_complete(), requests are processed in sequential
order.  So if an object request completes before one of its
predecessors in the image request, the completion is deferred.
Otherwise, if it's a completing object's "turn" to be completed, it
is passed to rbd_img_obj_end_request(), which records the result of
the operation, accumulates transferred bytes, and so on.  Next, the
successor to this request is checked and if it is marked "done",
(deferred) completion processing is performed on that request, and
so on.  If the last object request in an image request is completed,
rbd_img_request_complete() is called, which (typically) destroys
the image request.

There is a race here, however.  The instant an object request is
marked "done" it can be provided (by a thread handling completion of
one of its predecessor operations) to rbd_img_obj_end_request(),
which (for the last request) can then lead to the image request
getting torn down.  And this can happen *before* that object has
itself entered rbd_img_obj_end_request().  As a result, once it
*does* enter that function, the image request (and even the object
request itself) may have been freed and become invalid.

All that's necessary to avoid this is to properly count references
to the image requests.  We tear down an image request's object
requests all at once--only when the entire image request has
completed.  So there's no need for an image request to count
references for its object requests.  However, we don't want an
image request to go away until the last of its object requests
has passed through rbd_img_obj_callback().  In other words,
we don't want rbd_img_request_complete() to necessarily
result in the image request being destroyed, because it may
get called before we've finished processing on all of its
object requests.

So the fix is to add a reference to an image request for
each of its object requests.  The reference can be viewed
as representing an object request that has not yet finished
its call to rbd_img_obj_callback().  That is emphasized by
getting the reference right after assigning that as the image
object's callback function.  The corresponding release of that
reference is done at the end of rbd_img_obj_callback(), which
every image object request passes through exactly once.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Elder &lt;elder@linaro.org&gt;
Reviewed-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()</title>
<updated>2014-06-06T01:29:59+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>ilya.dryomov@inktank.com</email>
</author>
<published>2014-05-22T15:28:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=b30a01f2a307f55a505762ba09c0440d906c6711'/>
<id>b30a01f2a307f55a505762ba09c0440d906c6711</id>
<content type='text'>
osd_request, along with r_request and r_reply messages attached to it
are leaked in __rbd_dev_header_watch_sync() if the requested image
doesn't exist.  This is because lingering requests are special and get
an extra ref in the reply path.  Fix it by unregistering linger request
on the error path and split __rbd_dev_header_watch_sync() into two
functions to make it maintainable.

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
osd_request, along with r_request and r_reply messages attached to it
are leaked in __rbd_dev_header_watch_sync() if the requested image
doesn't exist.  This is because lingering requests are special and get
an extra ref in the reply path.  Fix it by unregistering linger request
on the error path and split __rbd_dev_header_watch_sync() into two
functions to make it maintainable.

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: make sure we have latest osdmap on 'rbd map'</title>
<updated>2014-06-06T01:29:58+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>ilya.dryomov@inktank.com</email>
</author>
<published>2014-05-13T07:19:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=30ba1f020221991cf239d905c82984958f29bdfe'/>
<id>30ba1f020221991cf239d905c82984958f29bdfe</id>
<content type='text'>
Given an existing idle mapping (img1), mapping an image (img2) in
a newly created pool (pool2) fails:

    $ ceph osd pool create pool1 8 8
    $ rbd create --size 1000 pool1/img1
    $ sudo rbd map pool1/img1
    $ ceph osd pool create pool2 8 8
    $ rbd create --size 1000 pool2/img2
    $ sudo rbd map pool2/img2
    rbd: sysfs write failed
    rbd: map failed: (2) No such file or directory

This is because client instances are shared by default and we don't
request an osdmap update when bumping a ref on an existing client.  The
fix is to use the mon_get_version request to see if the osdmap we have
is the latest, and block until the requested update is received if it's
not.

Fixes: http://tracker.ceph.com/issues/8184

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Sage Weil &lt;sage@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Given an existing idle mapping (img1), mapping an image (img2) in
a newly created pool (pool2) fails:

    $ ceph osd pool create pool1 8 8
    $ rbd create --size 1000 pool1/img1
    $ sudo rbd map pool1/img1
    $ ceph osd pool create pool2 8 8
    $ rbd create --size 1000 pool2/img2
    $ sudo rbd map pool2/img2
    rbd: sysfs write failed
    rbd: map failed: (2) No such file or directory

This is because client instances are shared by default and we don't
request an osdmap update when bumping a ref on an existing client.  The
fix is to use the mon_get_version request to see if the osdmap we have
is the latest, and block until the requested update is received if it's
not.

Fixes: http://tracker.ceph.com/issues/8184

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Sage Weil &lt;sage@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO</title>
<updated>2014-06-06T01:29:51+00:00</updated>
<author>
<name>Duan Jiong</name>
<email>duanj.fnst@cn.fujitsu.com</email>
</author>
<published>2014-04-11T08:38:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=461f758ac0bad40fe8e0959f415dae38efa16c12'/>
<id>461f758ac0bad40fe8e0959f415dae38efa16c12</id>
<content type='text'>
This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.

Signed-off-by: Duan Jiong &lt;duanj.fnst@cn.fujitsu.com&gt;
Reviewed-by: Yan, Zheng &lt;zheng.z.yan@intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.

Signed-off-by: Duan Jiong &lt;duanj.fnst@cn.fujitsu.com&gt;
Reviewed-by: Yan, Zheng &lt;zheng.z.yan@intel.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>rbd: prefix rbd writes with CEPH_OSD_OP_SETALLOCHINT osd op</title>
<updated>2014-04-03T02:33:52+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>ilya.dryomov@inktank.com</email>
</author>
<published>2014-02-25T14:22:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=0ccd59266973047770d5160318561c9189b79c93'/>
<id>0ccd59266973047770d5160318561c9189b79c93</id>
<content type='text'>
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 &lt;&lt; order).  Backwards compatibility is taken care
of on the libceph/osd side.

"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once.  The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into.  To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In an effort to reduce fragmentation, prefix every rbd write with
a CEPH_OSD_OP_SETALLOCHINT osd op with an expected_write_size value set
to the object size (1 &lt;&lt; order).  Backwards compatibility is taken care
of on the libceph/osd side.

"The CEPH_OSD_OP_SETALLOCHINT hint is durable, in that it's enough to
do it once.  The reason every rbd write is prefixed is that rbd doesn't
explicitly create objects and relies on writes creating them
implicitly, so there is no place to stick a single hint op into.  To
get around that we decided to prefix every rbd write with a hint (just
like write and setattr ops, hint op will create an object implicitly if
it doesn't exist)."

Signed-off-by: Ilya Dryomov &lt;ilya.dryomov@inktank.com&gt;
Reviewed-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@linaro.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
