linux-stable.git/drivers/block/rbd.c, branch linux-3.16.y

rbd: don't return 0 on unmap if RBD_DEV_FLAG_REMOVING is set

2019-05-02T20:41:17+00:00

commit 85f5a4d666fd9be73856ed16bb36c5af5b406b29 upstream.

There is a window between when RBD_DEV_FLAG_REMOVING is set and when
the device is removed from rbd_dev_list.  During this window, we set
"already" and return 0.

Returning 0 from write(2) can confuse userspace tools because
0 indicates that nothing was written.  In particular, "rbd unmap"
will retry the write multiple times a second:

  10:28:05.463299 write(4, "0", 1)        = 0
  10:28:05.463509 write(4, "0", 1)        = 0
  10:28:05.463720 write(4, "0", 1)        = 0
  10:28:05.463942 write(4, "0", 1)        = 0
  10:28:05.464155 write(4, "0", 1)        = 0

Signed-off-by: Ilya Dryomov 
Tested-by: Dongsheng Yang 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

rbd: whitelist RBD_FEATURE_OPERATIONS feature bit

2018-06-16T21:22:02+00:00

commit e573427a440fd67d3f522357d7ac901d59281948 upstream.

This feature bit restricts older clients from performing certain
maintenance operations against an image (e.g. clone, snap create).
krbd does not perform maintenance operations.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Jason Dillaman 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

rbd: use GFP_NOIO for parent stat and data requests

2018-01-01T20:52:06+00:00

commit 1e37f2f84680fa7f8394fd444b6928e334495ccc upstream.

rbd_img_obj_exists_submit() and rbd_img_obj_parent_read_full() are on
the writeback path for cloned images -- we attempt a stat on the parent
object to see if it exists and potentially read it in to call copyup.
GFP_NOIO should be used instead of GFP_KERNEL here.

Link: http://tracker.ceph.com/issues/22014
Signed-off-by: Ilya Dryomov 
Reviewed-by: David Disseldorp 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

rbd: fix rbd map vs notify races

2016-06-15T20:29:28+00:00

commit 811c6688774613a78bfa020f64b570b73f6974c8 upstream.

A while ago, commit 9875201e1049 ("rbd: fix use-after free of
rbd_dev->disk") fixed rbd unmap vs notify race by introducing
an exported wrapper for flushing notifies and sticking it into
do_rbd_remove().

A similar problem exists on the rbd map path, though: the watch is
registered in rbd_dev_image_probe(), while the disk is set up quite
a few steps later, in rbd_dev_device_setup().  Nothing prevents
a notify from coming in and crashing on a NULL rbd_dev->disk:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
    Call Trace:
     [] rbd_watch_cb+0x34/0x180 [rbd]
     [] do_event_work+0x40/0xb0 [libceph]
     [] process_one_work+0x17b/0x470
     [] worker_thread+0x11b/0x400
     [] ? rescuer_thread+0x400/0x400
     [] kthread+0xcf/0xe0
     [] ? finish_task_switch+0x53/0x170
     [] ? kthread_create_on_node+0x140/0x140
     [] ret_from_fork+0x58/0x90
     [] ? kthread_create_on_node+0x140/0x140
    RIP  [] rbd_dev_refresh+0xfa/0x180 [rbd]

If an error occurs during rbd map, we have to error out, potentially
tearing down a watch.  Just like on rbd unmap, notifies have to be
flushed, otherwise rbd_watch_cb() may end up trying to read in the
image header after rbd_dev_image_release() has run:

    Assertion failure in rbd_dev_header_info() at line 4722:

     rbd_assert(rbd_image_format_valid(rbd_dev->image_format));

    Call Trace:
     [] ? rbd_parent_request_create+0x150/0x150
     [] rbd_dev_refresh+0x59/0x390
     [] rbd_watch_cb+0x69/0x290
     [] do_event_work+0x10f/0x1c0
     [] process_one_work+0x689/0x1a80
     [] ? process_one_work+0x5e7/0x1a80
     [] ? finish_task_switch+0x225/0x640
     [] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
     [] worker_thread+0xd9/0x1320
     [] ? process_one_work+0x1a80/0x1a80
     [] kthread+0x21d/0x2e0
     [] ? kthread_stop+0x550/0x550
     [] ret_from_fork+0x22/0x40
     [] ? kthread_stop+0x550/0x550
    RIP  [] rbd_dev_header_info+0xa19/0x1e30

To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling
revalidate_disk(), b) move ceph_osdc_flush_notifies() call into
rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn
header read-in into a critical section.  The latter also happens to
take care of rbd map foo@bar vs rbd snap rm foo@bar race.

Fixes: http://tracker.ceph.com/issues/15490

Signed-off-by: Ilya Dryomov 
Reviewed-by: Josh Durgin 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

rbd: require stable pages if message data CRCs are enabled

2015-11-16T11:27:15+00:00

commit bae818ee1577c27356093901a0ea48f672eda514 upstream.

rbd requires stable pages, as it performs a crc of the page data before
they are send to the OSDs.

But since kernel 3.9 (patch 1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0
"mm: only enforce stable page writes if the backing device requires
it") it is not assumed anymore that block devices require stable pages.

This patch sets the necessary flag to get stable pages back for rbd.

In a ceph installation that provides multiple ext4 formatted rbd
devices "bad crc" messages appeared regularly (ca 1 message every 1-2
minutes on every OSD that provided the data for the rbd) in the
OSD-logs before this patch. After this patch this messages are pretty
much gone (only ca 1-2 / month / OSD).

Signed-off-by: Ronny Hegewald 
[idryomov@gmail.com: require stable pages only in crc case, changelog]
Signed-off-by: Ilya Dryomov 
[idryomov@gmail.com: backport to 3.9-3.17: context]
Signed-off-by: Luis Henriques

rbd: prevent kernel stack blow up on rbd map

2015-11-16T11:27:06+00:00

commit 6d69bb536bac0d403d83db1ca841444981b280cd upstream.

Mapping an image with a long parent chain (e.g. image foo, whose parent
is bar, whose parent is baz, etc) currently leads to a kernel stack
overflow, due to the following recursion in the reply path:

  rbd_osd_req_callback()
    rbd_obj_request_complete()
      rbd_img_obj_callback()
        rbd_img_parent_read_callback()
          rbd_obj_request_complete()
            ...

Limit the parent chain to 16 images, which is ~5K worth of stack.  When
the above recursion is eliminated, this limit can be lifted.

Fixes: http://tracker.ceph.com/issues/12538

Signed-off-by: Ilya Dryomov 
Reviewed-by: Josh Durgin 
[idryomov@gmail.com: backport to 3.14: rbd_dev->opts, context]
Signed-off-by: Luis Henriques

rbd: don't leak parent_spec in rbd_dev_probe_parent()

2015-11-16T11:27:05+00:00

commit 1f2c6651f69c14d0d3a9cfbda44ea101b02160ba upstream.

Currently we leak parent_spec and trigger a "parent reference
underflow" warning if rbd_dev_create() in rbd_dev_probe_parent() fails.
The problem is we take the !parent out_err branch and that only drops
refcounts; parent_spec that would've been freed had we called
rbd_dev_unparent() remains and triggers rbd_warn() in
rbd_dev_parent_put() - at that point we have parent_spec != NULL and
parent_ref == 0, so counter ends up being -1 after the decrement.

Redo rbd_dev_probe_parent() to fix this.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
[idryomov@gmail.com: backport to < 4.2: rbd_dev->opts]
Signed-off-by: Luis Henriques

rbd: fix double free on rbd_dev->header_name

2015-11-16T11:26:54+00:00

commit 3ebe138ac642a195c7f2efdb918f464734421fd6 upstream.

If rbd_dev_image_probe() in rbd_dev_probe_parent() fails, header_name
is freed twice: once in rbd_dev_probe_parent() and then in its caller
rbd_dev_image_probe() (rbd_dev_image_probe() is called recursively to
handle parent images).

rbd_dev_probe_parent() is responsible for probing the parent, so it
shouldn't muck with clone's fields.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
Signed-off-by: Luis Henriques

rbd: fix copyup completion race

2015-08-27T11:08:07+00:00

commit 2761713d35e370fd640b5781109f753066b746c4 upstream.

For write/discard obj_requests that involved a copyup method call, the
opcode of the first op is CEPH_OSD_OP_CALL and the ->callback is
rbd_img_obj_copyup_callback().  The latter frees copyup pages, sets
->xferred and delegates to rbd_img_obj_callback(), the "normal" image
object callback, for reporting to block layer and putting refs.

rbd_osd_req_callback() however treats CEPH_OSD_OP_CALL as a trivial op,
which means obj_request is marked done in rbd_osd_trivial_callback(),
*before* ->callback is invoked and rbd_img_obj_copyup_callback() has
a chance to run.  Marking obj_request done essentially means giving
rbd_img_obj_callback() a license to end it at any moment, so if another
obj_request from the same img_request is being completed concurrently,
rbd_img_obj_end_request() may very well be called on such prematurally
marked done request:


handle_reply()
  rbd_osd_req_callback()
    rbd_osd_trivial_callback()
    rbd_obj_request_complete()
    rbd_img_obj_copyup_callback()
    rbd_img_obj_callback()
                                    
                                    handle_reply()
                                      rbd_osd_req_callback()
                                        rbd_osd_trivial_callback()
      for_each_obj_request(obj_request->img_request) {
        rbd_img_obj_end_request(obj_request-1/2)
        rbd_img_obj_end_request(obj_request-2/2) <--
      }

Calling rbd_img_obj_end_request() on such a request leads to trouble,
in particular because its ->xfferred is 0.  We report 0 to the block
layer with blk_update_request(), get back 1 for "this request has more
data in flight" and then trip on

    rbd_assert(more ^ (which == img_request->obj_request_count));

with rhs (which == ...) being 1 because rbd_img_obj_end_request() has
been called for both requests and lhs (more) being 1 because we haven't
got a chance to set ->xfferred in rbd_img_obj_copyup_callback() yet.

To fix this, leverage that rbd wants to call class methods in only two
cases: one is a generic method call wrapper (obj_request is standalone)
and the other is a copyup (obj_request is part of an img_request).  So
make a dedicated handler for CEPH_OSD_OP_CALL and directly invoke
rbd_img_obj_copyup_callback() from it if obj_request is part of an
img_request, similar to how CEPH_OSD_OP_READ handler invokes
rbd_img_obj_request_read_callback().

Since rbd_img_obj_copyup_callback() is now being called from the OSD
request callback (only), it is renamed to rbd_osd_copyup_callback().

Cc: Alex Elder 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
[idryomov@gmail.com: backport to < 3.18: context]
Signed-off-by: Luis Henriques

rbd: use GFP_NOIO in rbd_obj_request_create()

2015-07-15T09:01:02+00:00

commit 5a60e87603c4c533492c515b7f62578189b03c9c upstream.

rbd_obj_request_create() is called on the main I/O path, so we need to
use GFP_NOIO to make sure allocation doesn't blow back on us.  Not all
callers need this, but I'm still hardcoding the flag inside rather than
making it a parameter because a) this is going to stable, and b) those
callers shouldn't really use rbd_obj_request_create() and will be fixed
in the future.

More memory allocation fixes will follow.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
Signed-off-by: Luis Henriques