linux-stable.git/net/ceph, branch v3.12.64

libceph: apply new_state before new_up_client on incrementals

2016-08-19T07:50:45+00:00

commit 930c532869774ebf8af9efe9484c597f896a7d46 upstream.

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i < pg->pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Signed-off-by: Ilya Dryomov 
Reviewed-by: Josh Durgin 
[idryomov@gmail.com: backport to 3.10-3.14: strip primary-affinity]
Signed-off-by: Jiri Slaby

libceph: set 'exists' flag for newly up osd

2016-08-19T07:50:44+00:00

commit 6dd74e44dc1df85f125982a8d6591bc4a76c9f5d upstream.

Signed-off-by: Yan, Zheng 
Reviewed-by: Sage Weil 
Signed-off-by: Ilya Dryomov 
Cc: Ilya Dryomov 
Signed-off-by: Jiri Slaby

libceph: don't bail early from try_read() when skipping a message

2016-03-03T11:46:05+00:00

commit e7a88e82fe380459b864e05b372638aeacb0f52d upstream.

The contract between try_read() and try_write() is that when called
each processes as much data as possible.  When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more.  try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.

Reported-by: Varada Kari 
Signed-off-by: Ilya Dryomov 
Tested-by: Varada Kari 
Reviewed-by: Alex Elder 
Signed-off-by: Jiri Slaby

crush: fix a bug in tree bucket decode

2015-08-04T14:52:26+00:00

commit 82cd003a77173c91b9acad8033fb7931dac8d751 upstream.

struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used.  -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews.  The actual problem (at least for
common crushmaps) isn't the u32 -> u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.

Fixes: http://tracker.ceph.com/issues/2759

Signed-off-by: Ilya Dryomov 
Reviewed-by: Josh Durgin 
Signed-off-by: Jiri Slaby

libceph: request a new osdmap if lingering request maps to no osd

2015-06-03T09:33:07+00:00

commit b0494532214bdfbf241e94fabab5dd46f7b82631 upstream.

This commit does two things.  First, if there are any homeless
lingering requests, we now request a new osdmap even if the osdmap that
is being processed brought no changes, i.e. if a given lingering
request turned homeless in one of the previous epochs and remained
homeless in the current epoch.  Not doing so leaves us with a stale
osdmap and as a result we may miss our window for reestablishing the
watch and lose notifies.

MON=1 OSD=1:

    # cat linger-needmap.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    rbd map dne/dne # obtain a new osdmap as a side effect (!)
    sleep 1
    ceph osd in 0
    rbd resize --size 2 test
    # rbd info test | grep size -> 2M
    # blockdev --getsize $DEV -> 1M

N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
above is enough to make it miss that resize notify, but that is a
bug^Wlimitation of ceph watch/notify v1.

Second, homeless lingering requests are now kicked just like those
lingering requests whose mapping has changed.  This is mainly to
recognize that a homeless lingering request makes no sense and to
preserve the invariant that a registered lingering request is not
sitting on any of r_req_lru_item lists.  This spares us a WARN_ON,
which commit ba9d114ec557 ("libceph: clear r_req_lru_item in
__unregister_linger_request()") tried to fix the _wrong_ way.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Jiri Slaby

libceph: fix double __remove_osd() problem

2015-03-05T14:36:59+00:00

commit 7eb71e0351fbb1b242ae70abb7bb17107fe2f792 upstream.

It turns out it's possible to get __remove_osd() called twice on the
same OSD.  That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory.  One scenario that I was able to reproduce is as follows:

            

con_fault_finish()
  osd_reset()
                              
                              ceph_osdc_handle_map()
                                
                                kick_requests()
                                  
                                  reset_changed_osds()
                                    __reset_osd()
                                      __remove_osd()
                                  
                                
    
    
    __kick_osd_requests()
      __reset_osd()
        __remove_osd() <-- !!!

A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().

Fixes: http://tracker.ceph.com/issues/8087

Cc: Sage Weil 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Reviewed-by: Alex Elder 
Signed-off-by: Jiri Slaby

libceph: change from BUG to WARN for __remove_osd() asserts

2015-03-05T14:36:58+00:00

commit cc9f1f518cec079289d11d732efa490306b1ddad upstream.

No reason to use BUG_ON for osd request list assertions.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
Signed-off-by: Jiri Slaby

libceph: assert both regular and lingering lists in __remove_osd()

2015-03-05T14:36:57+00:00

commit 7c6e6fc53e7335570ed82f77656cedce1502744e upstream.

It is important that both regular and lingering requests lists are
empty when the OSD is removed.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder 
Signed-off-by: Jiri Slaby

libceph: do not crash on large auth tickets

2014-11-19T17:38:17+00:00

commit aaef31703a0cf6a733e651885bfb49edc3ac6774 upstream.

Large (greater than 32k, the value of PAGE_ALLOC_COSTLY_ORDER) auth
tickets will have their buffers vmalloc'ed, which leads to the
following crash in crypto:

[   28.685082] BUG: unable to handle kernel paging request at ffffeb04000032c0
[   28.686032] IP: [] scatterwalk_pagedone+0x22/0x80
[   28.686032] PGD 0
[   28.688088] Oops: 0000 [#1] PREEMPT SMP
[   28.688088] Modules linked in:
[   28.688088] CPU: 0 PID: 878 Comm: kworker/0:2 Not tainted 3.17.0-vm+ #305
[   28.688088] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[   28.688088] Workqueue: ceph-msgr con_work
[   28.688088] task: ffff88011a7f9030 ti: ffff8800d903c000 task.ti: ffff8800d903c000
[   28.688088] RIP: 0010:[]  [] scatterwalk_pagedone+0x22/0x80
[   28.688088] RSP: 0018:ffff8800d903f688  EFLAGS: 00010286
[   28.688088] RAX: ffffeb04000032c0 RBX: ffff8800d903f718 RCX: ffffeb04000032c0
[   28.688088] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d903f750
[   28.688088] RBP: ffff8800d903f688 R08: 00000000000007de R09: ffff8800d903f880
[   28.688088] R10: 18df467c72d6257b R11: 0000000000000000 R12: 0000000000000010
[   28.688088] R13: ffff8800d903f750 R14: ffff8800d903f8a0 R15: 0000000000000000
[   28.688088] FS:  00007f50a41c7700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[   28.688088] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   28.688088] CR2: ffffeb04000032c0 CR3: 00000000da3f3000 CR4: 00000000000006b0
[   28.688088] Stack:
[   28.688088]  ffff8800d903f698 ffffffff81392ca8 ffff8800d903f6e8 ffffffff81395d32
[   28.688088]  ffff8800dac96000 ffff880000000000 ffff8800d903f980 ffff880119b7e020
[   28.688088]  ffff880119b7e010 0000000000000000 0000000000000010 0000000000000010
[   28.688088] Call Trace:
[   28.688088]  [] scatterwalk_done+0x38/0x40
[   28.688088]  [] scatterwalk_done+0x38/0x40
[   28.688088]  [] blkcipher_walk_done+0x182/0x220
[   28.688088]  [] crypto_cbc_encrypt+0x15f/0x180
[   28.688088]  [] ? crypto_aes_set_key+0x30/0x30
[   28.688088]  [] ceph_aes_encrypt2+0x29c/0x2e0
[   28.688088]  [] ceph_encrypt2+0x93/0xb0
[   28.688088]  [] ceph_x_encrypt+0x4a/0x60
[   28.688088]  [] ? ceph_buffer_new+0x5d/0xf0
[   28.688088]  [] ceph_x_build_authorizer.isra.6+0x297/0x360
[   28.688088]  [] ? kmem_cache_alloc_trace+0x11b/0x1c0
[   28.688088]  [] ? ceph_auth_create_authorizer+0x36/0x80
[   28.688088]  [] ceph_x_create_authorizer+0x63/0xd0
[   28.688088]  [] ceph_auth_create_authorizer+0x54/0x80
[   28.688088]  [] get_authorizer+0x80/0xd0
[   28.688088]  [] prepare_write_connect+0x18b/0x2b0
[   28.688088]  [] try_read+0x1e59/0x1f10

This is because we set up crypto scatterlists as if all buffers were
kmalloc'ed.  Fix it.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Jiri Slaby

libceph: ceph-msgr workqueue needs a resque worker

2014-10-31T14:11:32+00:00

commit f9865f06f7f18c6661c88d0511f05c48612319cc upstream.

Commit f363e45fd118 ("net/ceph: make ceph_msgr_wq non-reentrant")
effectively removed WQ_MEM_RECLAIM flag from ceph_msgr_wq.  This is
wrong - libceph is very much a memory reclaim path, so restore it.

Cc: stable@vger.kernel.org # needs backporting for < 3.12
Signed-off-by: Ilya Dryomov 
Tested-by: Micha Krause 
Reviewed-by: Sage Weil 
Signed-off-by: Jiri Slaby