linux-stable.git/net/ceph, branch v3.18.26

crush: ensuring at most num-rep osds are selected

2015-06-15T18:29:41+00:00

[ Upstream commit 45002267e8d2699bf9b022315bee3dd13b044843 ]

Crush temporary buffers are allocated as per replica size configured
by the user.  When there are more final osds (to be selected as per
rule) than the replicas, buffer overlaps and it causes crash.  Now, it
ensures that at most num-rep osds are selected even if more number of
osds are allowed by the rule.

Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0,
                          234b066ba04976783d15ff2abc3e81b6cc06fb10.

Signed-off-by: Ilya Dryomov 
Signed-off-by: Sasha Levin

Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"

2015-06-09T17:43:54+00:00

[ Upstream commit 521a04d06a729e5971cdee7f84080387ed320527 ]

This reverts commit ba9d114ec5578e6e99a4dfa37ff8ae688040fd64.

.. which introduced a regression that prevented all lingering requests
requeued in kick_requests() from ever being sent to the OSDs, resulting
in a lot of missed notifies.  In retrospect it's pretty obvious that
r_req_lru_item item in the case of lingering requests can be used not
only for notarget, but also for unsent linkage due to how tightly
actual map and enqueue operations are coupled in __map_request().

The assertion that was being silenced is taken care of in the previous
("libceph: request a new osdmap if lingering request maps to no osd")
commit: by always kicking homeless lingering requests we ensure that
none of them ends up on the notarget list outside of the critical
section guarded by request_mutex.

Cc: stable@vger.kernel.org # 3.18+, needs b0494532214b "libceph: request a new osdmap if lingering request maps to no osd"
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Sasha Levin

Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"

2015-06-09T17:43:53+00:00

[ Upstream commit 521a04d06a729e5971cdee7f84080387ed320527 ]

This reverts commit ba9d114ec5578e6e99a4dfa37ff8ae688040fd64.

.. which introduced a regression that prevented all lingering requests
requeued in kick_requests() from ever being sent to the OSDs, resulting
in a lot of missed notifies.  In retrospect it's pretty obvious that
r_req_lru_item item in the case of lingering requests can be used not
only for notarget, but also for unsent linkage due to how tightly
actual map and enqueue operations are coupled in __map_request().

The assertion that was being silenced is taken care of in the previous
("libceph: request a new osdmap if lingering request maps to no osd")
commit: by always kicking homeless lingering requests we ensure that
none of them ends up on the notarget list outside of the critical
section guarded by request_mutex.

Cc: stable@vger.kernel.org # 3.18+, needs b0494532214b "libceph: request a new osdmap if lingering request maps to no osd"
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Sasha Levin

Revert "libceph: use memalloc flags for net IO"

2015-04-17T00:13:14+00:00

[ Upstream commit 6d7fdb0ab351b33d4c12d53fe44be030b90fc9d4 ]

This reverts commit 89baaa570ab0b476db09408d209578cfed700e9f.

Dirty page throttling should be sufficient for us in the general case
so there is no need to use __GFP_MEMALLOC - it would be needed only in
the swap-over-rbd case, which we currently don't support.  (It would
probably take approximately the commit that is being reverted to add
that support, but we would also need the "swap" option to distinguish
from the general case and make sure swap ceph_client-s aren't shared
with anything else.)  See ceph-devel threads [1] and [2] for the
details of why enabling pfmemalloc reserves for all cases is a bad
thing.

On top of potential system lockups related to drained emergency
reserves, this turned out to cause ceph lockups in case peers are on
the same host and communicating via loopback due to sk_filter()
dropping pfmemalloc skbs on the receiving side because the receiving
loopback socket is not tagged with SOCK_MEMALLOC.

[1] "SOCK_MEMALLOC vs loopback"
    http://www.spinics.net/lists/ceph-devel/msg22998.html
[2] "[PATCH] libceph: don't set memalloc flags in loopback case"
    http://www.spinics.net/lists/ceph-devel/msg23392.html

Conflicts:
	net/ceph/messenger.c [ context: tcp_nodelay option ]

Cc: Mike Christie 
Cc: Mel Gorman 
Cc: Sage Weil 
Cc: stable@vger.kernel.org # 3.18+, needs backporting
Signed-off-by: Ilya Dryomov 
Acked-by: Mike Christie 
Acked-by: Mel Gorman 
[idryomov@gmail.com: backport to 3.18, 3.19: context]
Signed-off-by: Sasha Levin

libceph: fix double __remove_osd() problem

2015-03-06T22:53:05+00:00

commit 7eb71e0351fbb1b242ae70abb7bb17107fe2f792 upstream.

It turns out it's possible to get __remove_osd() called twice on the
same OSD.  That doesn't sit well with rb_erase() - depending on the
shape of the tree we can get a NULL dereference, a soft lockup or
a random crash at some point in the future as we end up touching freed
memory.  One scenario that I was able to reproduce is as follows:

            

con_fault_finish()
  osd_reset()
                              
                              ceph_osdc_handle_map()
                                
                                kick_requests()
                                  
                                  reset_changed_osds()
                                    __reset_osd()
                                      __remove_osd()
                                  
                                
    
    
    __kick_osd_requests()
      __reset_osd()
        __remove_osd() <-- !!!

A case can be made that osd refcounting is imperfect and reworking it
would be a proper resolution, but for now Sage and I decided to fix
this by adding a safe guard around __remove_osd().

Fixes: http://tracker.ceph.com/issues/8087

Cc: Sage Weil 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Reviewed-by: Alex Elder 
Signed-off-by: Greg Kroah-Hartman

libceph: change from BUG to WARN for __remove_osd() asserts

2014-11-13T19:26:34+00:00

No reason to use BUG_ON for osd request list assertions.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder

libceph: clear r_req_lru_item in __unregister_linger_request()

2014-11-13T19:21:14+00:00

kick_requests() can put linger requests on the notarget list.  This
means we need to clear the much-overloaded req->r_req_lru_item in
__unregister_linger_request() as well, or we get an assertion failure
in ceph_osdc_release_request() - !list_empty(&req->r_req_lru_item).

AFAICT the assumption was that registered linger requests cannot be on
any of req->r_req_lru_item lists, but that's clearly not the case.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder

libceph: unlink from o_linger_requests when clearing r_osd

2014-11-13T19:21:13+00:00

Requests have to be unlinked from both osd->o_requests (normal
requests) and osd->o_linger_requests (linger requests) lists when
clearing req->r_osd.  Otherwise __unregister_linger_request() gets
confused and we trip over a !list_empty(&osd->o_linger_requests)
assert in __remove_osd().

MON=1 OSD=1:

    # cat remove-osd.sh
    #!/bin/bash
    rbd create --size 1 test
    DEV=$(rbd map test)
    ceph osd out 0
    sleep 3
    rbd map dne/dne # obtain a new osdmap as a side effect
    rbd unmap $DEV & # will block
    sleep 3
    ceph osd in 0

Signed-off-by: Ilya Dryomov 
Reviewed-by: Alex Elder

libceph: do not crash on large auth tickets

2014-11-13T19:21:12+00:00

Large (greater than 32k, the value of PAGE_ALLOC_COSTLY_ORDER) auth
tickets will have their buffers vmalloc'ed, which leads to the
following crash in crypto:

[   28.685082] BUG: unable to handle kernel paging request at ffffeb04000032c0
[   28.686032] IP: [] scatterwalk_pagedone+0x22/0x80
[   28.686032] PGD 0
[   28.688088] Oops: 0000 [#1] PREEMPT SMP
[   28.688088] Modules linked in:
[   28.688088] CPU: 0 PID: 878 Comm: kworker/0:2 Not tainted 3.17.0-vm+ #305
[   28.688088] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[   28.688088] Workqueue: ceph-msgr con_work
[   28.688088] task: ffff88011a7f9030 ti: ffff8800d903c000 task.ti: ffff8800d903c000
[   28.688088] RIP: 0010:[]  [] scatterwalk_pagedone+0x22/0x80
[   28.688088] RSP: 0018:ffff8800d903f688  EFLAGS: 00010286
[   28.688088] RAX: ffffeb04000032c0 RBX: ffff8800d903f718 RCX: ffffeb04000032c0
[   28.688088] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8800d903f750
[   28.688088] RBP: ffff8800d903f688 R08: 00000000000007de R09: ffff8800d903f880
[   28.688088] R10: 18df467c72d6257b R11: 0000000000000000 R12: 0000000000000010
[   28.688088] R13: ffff8800d903f750 R14: ffff8800d903f8a0 R15: 0000000000000000
[   28.688088] FS:  00007f50a41c7700(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
[   28.688088] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   28.688088] CR2: ffffeb04000032c0 CR3: 00000000da3f3000 CR4: 00000000000006b0
[   28.688088] Stack:
[   28.688088]  ffff8800d903f698 ffffffff81392ca8 ffff8800d903f6e8 ffffffff81395d32
[   28.688088]  ffff8800dac96000 ffff880000000000 ffff8800d903f980 ffff880119b7e020
[   28.688088]  ffff880119b7e010 0000000000000000 0000000000000010 0000000000000010
[   28.688088] Call Trace:
[   28.688088]  [] scatterwalk_done+0x38/0x40
[   28.688088]  [] scatterwalk_done+0x38/0x40
[   28.688088]  [] blkcipher_walk_done+0x182/0x220
[   28.688088]  [] crypto_cbc_encrypt+0x15f/0x180
[   28.688088]  [] ? crypto_aes_set_key+0x30/0x30
[   28.688088]  [] ceph_aes_encrypt2+0x29c/0x2e0
[   28.688088]  [] ceph_encrypt2+0x93/0xb0
[   28.688088]  [] ceph_x_encrypt+0x4a/0x60
[   28.688088]  [] ? ceph_buffer_new+0x5d/0xf0
[   28.688088]  [] ceph_x_build_authorizer.isra.6+0x297/0x360
[   28.688088]  [] ? kmem_cache_alloc_trace+0x11b/0x1c0
[   28.688088]  [] ? ceph_auth_create_authorizer+0x36/0x80
[   28.688088]  [] ceph_x_create_authorizer+0x63/0xd0
[   28.688088]  [] ceph_auth_create_authorizer+0x54/0x80
[   28.688088]  [] get_authorizer+0x80/0xd0
[   28.688088]  [] prepare_write_connect+0x18b/0x2b0
[   28.688088]  [] try_read+0x1e59/0x1f10

This is because we set up crypto scatterlists as if all buffers were
kmalloc'ed.  Fix it.

Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil

libceph: eliminate unnecessary allocation in process_one_ticket()

2014-10-31T20:43:08+00:00

Commit c27a3e4d667f ("libceph: do not hard code max auth ticket len")
while fixing a buffer overlow tried to keep the same as much of the
surrounding code as possible and introduced an unnecessary kmalloc() in
the unencrypted ticket path.  It is likely to fail on huge tickets, so
get rid of it.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil