linux-stable.git/net/ceph, branch linux-3.16.y

libceph: handle an empty authorize reply

2020-01-11T02:04:41+00:00

commit 0fd3fd0a9bb0b02b6435bb7070e9f7b82a23f068 upstream.

The authorize reply can be empty, for example when the ticket used to
build the authorizer is too old and TAG_BADAUTHORIZER is returned from
the service.  Calling ->verify_authorizer_reply() results in an attempt
to decrypt and validate (somewhat) random data in au->buf (most likely
the signature block from calc_signature()), which fails and ends up in
con_fault_finish() with !con->auth_retry.  The ticket isn't invalidated
and the connection is retried again and again until a new ticket is
obtained from the monitor:

  libceph: osd2 192.168.122.1:6809 bad authorize reply
  libceph: osd2 192.168.122.1:6809 bad authorize reply
  libceph: osd2 192.168.122.1:6809 bad authorize reply
  libceph: osd2 192.168.122.1:6809 bad authorize reply

Let TAG_BADAUTHORIZER handler kick in and increment con->auth_retry.

Fixes: 5c056fdc5b47 ("libceph: verify authorize reply on connect")
Link: https://tracker.ceph.com/issues/20164
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
[idryomov@gmail.com: backport to 4.4: extra arg, no CEPHX_V2]
Signed-off-by: Ben Hutchings

libceph: validate con->state at the top of try_write()

2018-10-21T07:45:46+00:00

commit 9c55ad1c214d9f8c4594ac2c3fa392c1c32431a7 upstream.

ceph_con_workfn() validates con->state before calling try_read() and
then try_write().  However, try_read() temporarily releases con->mutex,
notably in process_message() and ceph_con_in_msg_alloc(), opening the
window for ceph_con_close() to sneak in, close the connection and
release con->sock.  When try_write() is called on the assumption that
con->state is still valid (i.e. not STANDBY or CLOSED), a NULL sock
gets passed to the networking stack:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
  IP: selinux_socket_sendmsg+0x5/0x20

Make sure con->state is valid at the top of try_write() and add an
explicit BUG_ON for this, similar to try_read().

Link: https://tracker.ceph.com/issues/23706
Signed-off-by: Ilya Dryomov 
Reviewed-by: Jason Dillaman 
Signed-off-by: Ben Hutchings

libceph: potential NULL dereference in ceph_msg_data_create()

2017-11-11T13:32:58+00:00

commit 7c40b22f6f84c98a1d36e6d0a4346e58f05e45d8 upstream.

If kmem_cache_zalloc() returns NULL then the INIT_LIST_HEAD(&data->links);
will Oops.  The callers aren't really prepared for NULL returns so it
doesn't make a lot of difference in real life.

Fixes: 5240d9f95dfe ("libceph: replace message data pointer with list")
Signed-off-by: Dan Carpenter 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Ben Hutchings

libceph: NULL deref on crush_decode() error path

2017-09-15T17:29:49+00:00

commit 293dffaad8d500e1a5336eeb90d544cf40d4fbd8 upstream.

If there is not enough space then ceph_decode_32_safe() does a goto bad.
We need to return an error code in that situation.  The current code
returns ERR_PTR(0) which is NULL.  The callers are not expecting that
and it results in a NULL dereference.

Fixes: f24e9980eb86 ("ceph: OSD client")
Signed-off-by: Dan Carpenter 
Reviewed-by: Ilya Dryomov 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Ben Hutchings

libceph: force GFP_NOIO for socket allocations

2017-07-18T17:40:21+00:00

commit 633ee407b9d15a75ac9740ba9d3338815e1fcb95 upstream.

sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [] schedule+0x29/0x70
    [] schedule_timeout+0x1bd/0x200
    [] ? ttwu_do_wakeup+0x2c/0x120
    [] ? ttwu_do_activate.constprop.135+0x66/0x70
    [] wait_for_completion+0xbf/0x180
    [] ? try_to_wake_up+0x390/0x390
    [] flush_work+0x165/0x250
    [] ? worker_detach_from_pool+0xd0/0xd0
    [] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [] ? __slab_free+0xee/0x234
    [] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [] ? lookup_page_cgroup_used+0xe/0x30
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [] ? wake_atomic_t_function+0x40/0x40
    [] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [] super_cache_scan+0x178/0x180
    [] shrink_slab_node+0x14e/0x340
    [] ? mem_cgroup_iter+0x16b/0x450
    [] shrink_slab+0x100/0x140
    [] do_try_to_free_pages+0x335/0x490
    [] try_to_free_pages+0xb9/0x1f0
    [] ? __alloc_pages_direct_compact+0x69/0x1be
    [] __alloc_pages_nodemask+0x69a/0xb40
    [] alloc_pages_current+0x9e/0x110
    [] new_slab+0x2c5/0x390
    [] __slab_alloc+0x33b/0x459
    [] ? sock_alloc_inode+0x2d/0xd0
    [] ? inet_sendmsg+0x71/0xc0
    [] ? sock_alloc_inode+0x2d/0xd0
    [] kmem_cache_alloc+0x1a2/0x1b0
    [] sock_alloc_inode+0x2d/0xd0
    [] alloc_inode+0x26/0xa0
    [] new_inode_pseudo+0x1a/0x70
    [] sock_alloc+0x1e/0x80
    [] __sock_create+0x95/0x220
    [] sock_create_kern+0x24/0x30
    [] con_work+0xef9/0x2050 [libceph]
    [] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [] process_one_work+0x159/0x4f0
    [] worker_thread+0x11b/0x530
    [] ? create_worker+0x1d0/0x1d0
    [] kthread+0xc9/0xe0
    [] ? flush_kthread_worker+0x90/0x90
    [] ret_from_fork+0x58/0x90
    [] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Jeff Layton 
[bwh: Backported to 3.16:
 - memalloc_noio_{save,restore}() are declared in 
 - Adjust context]
Signed-off-by: Ben Hutchings

libceph: don't set weight to IN when OSD is destroyed

2017-07-18T17:40:07+00:00

commit b581a5854eee4b7851dedb0f8c2ceb54fb902c06 upstream.

Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs.  Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Ben Hutchings

libceph: verify authorize reply on connect

2017-03-16T02:26:34+00:00

commit 5c056fdc5b474329037f2aa18401bd73033e0ce0 upstream.

After sending an authorizer (ceph_x_authorize_a + ceph_x_authorize_b),
the client gets back a ceph_x_authorize_reply, which it is supposed to
verify to ensure the authenticity and protect against replay attacks.
The code for doing this is there (ceph_x_verify_authorizer_reply(),
ceph_auth_verify_authorizer_reply() + plumbing), but it is never
invoked by the the messenger.

AFAICT this goes back to 2009, when ceph authentication protocols
support was added to the kernel client in 4e7a5dcd1bba ("ceph:
negotiate authentication protocol; implement AUTH_NONE protocol").

The second param of ceph_connection_operations::verify_authorizer_reply
is unused all the way down.  Pass 0 to facilitate backporting, and kill
it in the next commit.

Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
Signed-off-by: Ben Hutchings

libceph: apply new_state before new_up_client on incrementals

2016-11-20T01:16:57+00:00

commit 930c532869774ebf8af9efe9484c597f896a7d46 upstream.

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i < pg->pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Signed-off-by: Ilya Dryomov 
Reviewed-by: Josh Durgin 
Signed-off-by: Ben Hutchings

libceph: set 'exists' flag for newly up osd

2016-11-20T01:16:56+00:00

commit 6dd74e44dc1df85f125982a8d6591bc4a76c9f5d upstream.

Signed-off-by: Yan, Zheng 
Reviewed-by: Sage Weil 
Signed-off-by: Ilya Dryomov 
Signed-off-by: Ben Hutchings

libceph: make authorizer destruction independent of ceph_auth_client

2016-06-15T20:29:27+00:00

commit 6c1ea260f89709e0021d2c59f8fd2a104b5b1123 upstream.

Starting the kernel client with cephx disabled and then enabling cephx
and restarting userspace daemons can result in a crash:

    [262671.478162] BUG: unable to handle kernel paging request at ffffebe000000000
    [262671.531460] IP: [] kfree+0x5a/0x130
    [262671.584334] PGD 0
    [262671.635847] Oops: 0000 [#1] SMP
    [262672.055841] CPU: 22 PID: 2961272 Comm: kworker/22:2 Not tainted 4.2.0-34-generic #39~14.04.1-Ubuntu
    [262672.162338] Hardware name: Dell Inc. PowerEdge R720/068CDY, BIOS 2.4.3 07/09/2014
    [262672.268937] Workqueue: ceph-msgr con_work [libceph]
    [262672.322290] task: ffff88081c2d0dc0 ti: ffff880149ae8000 task.ti: ffff880149ae8000
    [262672.428330] RIP: 0010:[]  [] kfree+0x5a/0x130
    [262672.535880] RSP: 0018:ffff880149aeba58  EFLAGS: 00010286
    [262672.589486] RAX: 000001e000000000 RBX: 0000000000000012 RCX: ffff8807e7461018
    [262672.695980] RDX: 000077ff80000000 RSI: ffff88081af2be04 RDI: 0000000000000012
    [262672.803668] RBP: ffff880149aeba78 R08: 0000000000000000 R09: 0000000000000000
    [262672.912299] R10: ffffebe000000000 R11: ffff880819a60e78 R12: ffff8800aec8df40
    [262673.021769] R13: ffffffffc035f70f R14: ffff8807e5b138e0 R15: ffff880da9785840
    [262673.131722] FS:  0000000000000000(0000) GS:ffff88081fac0000(0000) knlGS:0000000000000000
    [262673.245377] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [262673.303281] CR2: ffffebe000000000 CR3: 0000000001c0d000 CR4: 00000000001406e0
    [262673.417556] Stack:
    [262673.472943]  ffff880149aeba88 ffff88081af2be04 ffff8800aec8df40 ffff88081af2be04
    [262673.583767]  ffff880149aeba98 ffffffffc035f70f ffff880149aebac8 ffff8800aec8df00
    [262673.694546]  ffff880149aebac8 ffffffffc035c89e ffff8807e5b138e0 ffff8805b047f800
    [262673.805230] Call Trace:
    [262673.859116]  [] ceph_x_destroy_authorizer+0x1f/0x50 [libceph]
    [262673.968705]  [] ceph_auth_destroy_authorizer+0x3e/0x60 [libceph]
    [262674.078852]  [] put_osd+0x45/0x80 [libceph]
    [262674.134249]  [] remove_osd+0xae/0x140 [libceph]
    [262674.189124]  [] __reset_osd+0x103/0x150 [libceph]
    [262674.243749]  [] kick_requests+0x223/0x460 [libceph]
    [262674.297485]  [] ceph_osdc_handle_map+0x282/0x5e0 [libceph]
    [262674.350813]  [] dispatch+0x4e/0x720 [libceph]
    [262674.403312]  [] try_read+0x3d1/0x1090 [libceph]
    [262674.454712]  [] ? dequeue_entity+0x152/0x690
    [262674.505096]  [] con_work+0xcb/0x1300 [libceph]
    [262674.555104]  [] process_one_work+0x14e/0x3d0
    [262674.604072]  [] worker_thread+0x11a/0x470
    [262674.652187]  [] ? rescuer_thread+0x310/0x310
    [262674.699022]  [] kthread+0xd2/0xf0
    [262674.744494]  [] ? kthread_create_on_node+0x1c0/0x1c0
    [262674.789543]  [] ret_from_fork+0x3f/0x70
    [262674.834094]  [] ? kthread_create_on_node+0x1c0/0x1c0

What happens is the following:

    (1) new MON session is established
    (2) old "none" ac is destroyed
    (3) new "cephx" ac is constructed
    ...
    (4) old OSD session (w/ "none" authorizer) is put
          ceph_auth_destroy_authorizer(ac, osd->o_auth.authorizer)

osd->o_auth.authorizer in the "none" case is just a bare pointer into
ac, which contains a single static copy for all services.  By the time
we get to (4), "none" ac, freed in (2), is long gone.  On top of that,
a new vtable installed in (3) points us at ceph_x_destroy_authorizer(),
so we end up trying to destroy a "none" authorizer with a "cephx"
destructor operating on invalid memory!

To fix this, decouple authorizer destruction from ac and do away with
a single static "none" authorizer by making a copy for each OSD or MDS
session.  Authorizers themselves are independent of ac and so there is
no reason for destroy_authorizer() to be an ac op.  Make it an op on
the authorizer itself by turning ceph_authorizer into a real struct.

Fixes: http://tracker.ceph.com/issues/15447

Reported-by: Alan Zhang 
Signed-off-by: Ilya Dryomov 
Reviewed-by: Sage Weil 
[bwh: Backported to 3.16:
 - Implementation of ceph_x_destroy_authorizer() is different
 - Adjust context]
Signed-off-by: Ben Hutchings