linux-stable.git/net/ceph, branch v3.16.40

libceph: apply new_state before new_up_client on incrementals

2016-11-20T01:16:57+00:00

commit 930c532869774ebf8af9efe9484c597f896a7d46 upstream.

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i < pg->pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Signed-off-by: Ilya Dryomov 
Reviewed-by: Josh Durgin 
Signed-off-by: Ben Hutchings

libceph: set 'exists' flag for newly up osd

2016-11-20T01:16:56+00:00

commit 6dd74e44dc1df85f125982a8d6591bc4a76c9f5d upstream.

Signed-off-by: Yan, Zheng 
Reviewed-by: Sage Weil 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Ben Hutchings

libceph: make authorizer destruction independent of ceph_auth_client

2016-06-15T20:29:27+00:00

commit 6c1ea260f89709e0021d2c59f8fd2a104b5b1123 upstream.

Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[]  [] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116]  [] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705]  [] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852]  [] put_osd+0x45/0x80 [libceph]
    [262674.134249]  [] remove_osd+0xae/0x140 [libceph]
    [262674.189124]  [] __reset_osd+0x103/0x150 [libceph]
    [262674.243749]  [] kick_requests+0x223/0x460 [libceph]
    [262674.297485]  [] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813]  [] dispatch+0x4e/0x720 [libceph]
    [262674.403312]  [] try_read+0x3d1/0x1090 [libceph]
    [262674.454712]  [] ? dequeue_entity+0x152/0x690
    [262674.505096]  [] con_work+0xcb/0x1300 [libceph]
    [262674.555104]  [] process_one_work+0x14e/0x3d0
    [262674.604072]  [] worker_thread+0x11a/0x470
    [262674.652187]  [] ? rescuer_thread+0x310/0x310
    [262674.699022]  [] kthread+0xd2/0xf0
    [262674.744494]  [] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543]  [] ret_from_fork+0x3f/0x70
    [262674.834094]  [] ? kthread_create_on_node+0x1c0/0x1c0

What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
          ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services.  By the time
we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!

To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session.  Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op.  Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.

Fixes: http://tracker.ceph.com/issues/15447

Reported-by: Alan Zhang 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
[bwh: Backported to 3.16:
 - Implementation of ceph_x_destroy_authorizer() is different
 - Adjust context]
Signed-off-by: Ben Hutchings

libceph: kfree() in put_osd() shouldn't depend on authorizer

2016-06-15T20:29:27+00:00

commit b28ec2f37e6a2bbd0bdf74b39cb89c74e4ad17f3 upstream.

a255651d4cad ("ceph: ensure auth ops are defined before use") made
kfree() in put_osd() conditional on the authorizer.  A mechanical
mistake most likely - fix it.

Cc: Alex Elder 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Reviewed-by: Alex Elder 
Signed-off-by: Ben Hutchings

libceph: don't bail early from try_read() when skipping a message

2016-03-08T12:15:31+00:00

commit e7a88e82fe380459b864e05b372638aeacb0f52d upstream.

The contract between try_read() and try_write() is that when called
each processes as much data as possible.  When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more.  try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.

Reported-by: Varada Kari 
Signed-off-by: Ilya Dryomov 
Tested-by: Varada Kari 
Reviewed-by: Alex Elder 
Signed-off-by: Luis Henriques

crush: fix a bug in tree bucket decode

2015-07-15T09:01:02+00:00

commit 82cd003a77173c91b9acad8033fb7931dac8d751 upstream.

struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used.  -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews.  The actual problem (at least for
common crushmaps) isn't the u32 -> u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.

Fixes: http://tracker.ceph.com/issues/2759

Signed-off-by: Ilya Dryomov 
Reviewed-by: Josh Durgin 
Signed-off-by: Luis Henriques

libceph: request a new osdmap if lingering request maps to no osd

2015-06-02T11:08:17+00:00

commit b0494532214bdfbf241e94fabab5dd46f7b82631 upstream.

This commit does two things.  First, if there are any homeless
lingering requests, we now request a new osdmap even if the osdmap that
is being processed brought no changes, i.e. if a given lingering
request turned homeless in one of the previous epochs and remained
homeless in the current epoch.  Not doing so leaves us with a stale
osdmap and as a result we may miss our window for reestablishing the
watch and lose notifies.

MON=1 OSD=1:

    # cat linger-needmap.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    rbd map dne/dne # obtain a new osdmap as a side effect (!)
    sleep 1
    ceph osd in 0
    rbd resize --size 2 test
    # rbd info test | grep size -> 2M
    # blockdev --getsize $DEV -> 1M

N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
above is enough to make it miss that resize notify, but that is a
bug^Wlimitation of ceph watch/notify v1.

Second, homeless lingering requests are now kicked just like those
lingering requests whose mapping has changed.  This is mainly to
recognize that a homeless lingering request makes no sense and to
preserve the invariant that a registered lingering request is not
sitting on any of r_req_lru_item lists.  This spares us a WARN_ON,
which commit ba9d114ec557 ("libceph: clear r_req_lru_item in
__unregister_linger_request()") tried to fix the _wrong_ way.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Luis Henriques

crush: ensuring at most num-rep osds are selected

2015-05-20T12:26:19+00:00

commit 45002267e8d2699bf9b022315bee3dd13b044843 upstream.

Crush temporary buffers are allocated as per replica size configured
by the user.  When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.  Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by the rule.

Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
                          234b066ba04976783d15ff2abc3e81b6cc06fb10.

Signed-off-by: Ilya Dryomov 
Signed-off-by: Luis Henriques

libceph: fix double __remove_osd() problem

2015-03-02T15:04:34+00:00

commit 7eb71e0351fbb1b242ae70abb7bb17107fe2f792 upstream.

It turns out it's possible to get __remove_osd() called twice on the
same OSD.  That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory.  One scenario that I was able to reproduce is as follows:

            

con_fault_finish()
  osd_reset()
                              
                              ceph_osdc_handle_map()
                                
                                kick_requests()
                                  
                                  reset_changed_osds()
                                    __reset_osd()
                                      __remove_osd()
                                  
                                
    
    
    __kick_osd_requests()
      __reset_osd()
        __remove_osd() <-- !!!

A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().

Fixes: http://tracker.ceph.com/issues/8087

Cc: Sage Weil 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Reviewed-by: Alex Elder 
Signed-off-by: Luis Henriques

libceph: change from BUG to WARN for __remove_osd() asserts

2015-03-02T15:04:34+00:00

commit cc9f1f518cec079289d11d732efa490306b1ddad upstream.

No reason to use BUG_ON for osd request list assertions.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
Signed-off-by: Luis Henriques