<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/net/ceph/osdmap.c, branch linux-3.10.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>libceph: don't set weight to IN when OSD is destroyed</title>
<updated>2017-06-07T22:46:46+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2017-03-01T16:33:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=dd7d4be1c6a153eca651da013914e3fa54ef93d9'/>
<id>dd7d4be1c6a153eca651da013914e3fa54ef93d9</id>
<content type='text'>
commit b581a5854eee4b7851dedb0f8c2ceb54fb902c06 upstream.

Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs.  Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Sage Weil &lt;sage@redhat.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit b581a5854eee4b7851dedb0f8c2ceb54fb902c06 upstream.

Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs.  Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Sage Weil &lt;sage@redhat.com&gt;
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: apply new_state before new_up_client on incrementals</title>
<updated>2016-08-21T21:22:37+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2016-07-24T16:32:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=1196c36fd53c3b1615eb02f986cb727b1dfc1047'/>
<id>1196c36fd53c3b1615eb02f986cb727b1dfc1047</id>
<content type='text'>
commit 930c532869774ebf8af9efe9484c597f896a7d46 upstream.

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i &lt; pg-&gt;pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg-&gt;pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp-&gt;osds[temp-&gt;size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Josh Durgin &lt;jdurgin@redhat.com&gt;
[idryomov@gmail.com: backport to 3.10-3.14: strip primary-affinity]
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 930c532869774ebf8af9efe9484c597f896a7d46 upstream.

Currently, osd_weight and osd_state fields are updated in the encoding
order.  This is wrong, because an incremental map may look like e.g.

    new_up_client: { osd=6, addr=... } # set osd_state and addr
    new_state: { osd=6, xorstate=EXISTS } # clear osd_state

Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state.  A non-existent OSD is considered down by the
mapping code

2087    for (i = 0; i &lt; pg-&gt;pg_temp.len; i++) {
2088            if (ceph_osd_is_down(osdmap, pg-&gt;pg_temp.osds[i])) {
2089                    if (ceph_can_shift_osds(pi))
2090                            continue;
2091
2092                    temp-&gt;osds[temp-&gt;size++] = CRUSH_ITEM_NONE;

and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:

[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680

and hung rbds on the client:

[  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[  493.566805] rbd: rbd0:   result -6 xferred 400000
[  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688

The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed

Fixes: http://tracker.ceph.com/issues/14901

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Josh Durgin &lt;jdurgin@redhat.com&gt;
[idryomov@gmail.com: backport to 3.10-3.14: strip primary-affinity]
Signed-off-by: Willy Tarreau &lt;w@1wt.eu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>crush: fix a bug in tree bucket decode</title>
<updated>2015-08-03T16:29:46+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2015-06-29T16:30:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7546f8cb2dff62ad20c50bf3d6170ad8dadb9cc8'/>
<id>7546f8cb2dff62ad20c50bf3d6170ad8dadb9cc8</id>
<content type='text'>
commit 82cd003a77173c91b9acad8033fb7931dac8d751 upstream.

struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used.  -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews.  The actual problem (at least for
common crushmaps) isn't the u32 -&gt; u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.

Fixes: http://tracker.ceph.com/issues/2759

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Josh Durgin &lt;jdurgin@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 82cd003a77173c91b9acad8033fb7931dac8d751 upstream.

struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe()
should be used.  -Wconversion catches this, but I guess it went
unnoticed in all the noise it spews.  The actual problem (at least for
common crushmaps) isn't the u32 -&gt; u8 truncation though - it's the
advancement by 4 bytes instead of 1 in the crushmap buffer.

Fixes: http://tracker.ceph.com/issues/2759

Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Josh Durgin &lt;jdurgin@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: use pg_num_mask instead of pgp_num_mask for pg.seed calc</title>
<updated>2013-09-27T00:18:29+00:00</updated>
<author>
<name>Sage Weil</name>
<email>sage@inktank.com</email>
</author>
<published>2013-08-29T00:17:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=fd5e2dea537bbf0bfb09f79a8b34c148bb502735'/>
<id>fd5e2dea537bbf0bfb09f79a8b34c148bb502735</id>
<content type='text'>
commit 9542cf0bf9b1a3adcc2ef271edbcbdba03abf345 upstream.

Fix a typo that used the wrong bitmask for the pg.seed calculation.  This
is normally unnoticed because in most cases pg_num == pgp_num.  It is, however,
a bug that is easily corrected.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;alex.elder@linary.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 9542cf0bf9b1a3adcc2ef271edbcbdba03abf345 upstream.

Fix a typo that used the wrong bitmask for the pg.seed calculation.  This
is normally unnoticed because in most cases pg_num == pgp_num.  It is, however,
a bug that is easily corrected.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;alex.elder@linary.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: define ceph_decode_pgid() only once</title>
<updated>2013-05-02T04:17:52+00:00</updated>
<author>
<name>Alex Elder</name>
<email>elder@inktank.com</email>
</author>
<published>2013-04-01T23:58:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ef4859d6479d19bcc65c3156cf3b7dd747355c29'/>
<id>ef4859d6479d19bcc65c3156cf3b7dd747355c29</id>
<content type='text'>
There are two basically identical definitions of __decode_pgid()
in libceph, one in "net/ceph/osdmap.c" and the other in
"net/ceph/osd_client.c".  Get rid of both, and instead define
a single inline version in "include/linux/ceph/osdmap.h".

Signed-off-by: Alex Elder &lt;elder@inktank.com&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There are two basically identical definitions of __decode_pgid()
in libceph, one in "net/ceph/osdmap.c" and the other in
"net/ceph/osd_client.c".  Get rid of both, and instead define
a single inline version in "include/linux/ceph/osdmap.h".

Signed-off-by: Alex Elder &lt;elder@inktank.com&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: rename ceph_calc_object_layout()</title>
<updated>2013-05-02T04:16:17+00:00</updated>
<author>
<name>Alex Elder</name>
<email>elder@inktank.com</email>
</author>
<published>2013-03-02T00:00:15+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=41766f87f54cc8bef023b4b0550f48753959345a'/>
<id>41766f87f54cc8bef023b4b0550f48753959345a</id>
<content type='text'>
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.

Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.

Change the function so it takes a pool number rather than the full
file layout structure.  Only update the ceph_pg if the pool is found
in the osd map.  Get rid of few useless lines of code from the
function while there.

Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().

Signed-off-by: Alex Elder &lt;elder@inktank.com&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The purpose of ceph_calc_object_layout() is to fill in the pool
number and seed for a ceph_pg structure provided, based on a given
osd map and target object id.

Currently that function takes a file layout parameter, but the only
thing used out of that is its pool number.

Change the function so it takes a pool number rather than the full
file layout structure.  Only update the ceph_pg if the pool is found
in the osd map.  Get rid of few useless lines of code from the
function while there.

Since the function now very clearly just fills in the ceph_pg
structure it's provided, rename it ceph_calc_ceph_pg().

Signed-off-by: Alex Elder &lt;elder@inktank.com&gt;
Reviewed-by: Josh Durgin &lt;josh.durgin@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: fix decoding of pgids</title>
<updated>2013-03-11T21:31:00+00:00</updated>
<author>
<name>Sage Weil</name>
<email>sage@inktank.com</email>
</author>
<published>2013-03-06T22:57:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d6c0dd6b0c196979fa7b34c1d99432fcb1b7e1df'/>
<id>d6c0dd6b0c196979fa7b34c1d99432fcb1b7e1df</id>
<content type='text'>
In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support
for the legacy encoding for the OSDMap and incremental.  However, we didn't
fix the decoding for the pgid.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Yehuda Sadeh &lt;yehuda@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support
for the legacy encoding for the OSDMap and incremental.  However, we didn't
fix the decoding for the pgid.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Yehuda Sadeh &lt;yehuda@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: add support for HASHPSPOOL pool flag</title>
<updated>2013-02-26T23:03:06+00:00</updated>
<author>
<name>Sage Weil</name>
<email>sage@inktank.com</email>
</author>
<published>2013-02-26T18:39:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=83ca14fdd35821554058e5fd4fa7b118ee504a33'/>
<id>83ca14fdd35821554058e5fd4fa7b118ee504a33</id>
<content type='text'>
The legacy behavior adds the pgid seed and pool together as the input for
CRUSH.  That is problematic because each pool's PGs end up mapping to the
same OSDs: 1.5 == 2.4 == 3.3 == ...

Instead, if the HASHPSPOOL flag is set, we has the ps and pool together and
feed that into CRUSH.  This ensures that two adjacent pools will map to
an independent pseudorandom set of OSDs.

Advertise our support for this via a protocol feature flag.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The legacy behavior adds the pgid seed and pool together as the input for
CRUSH.  That is problematic because each pool's PGs end up mapping to the
same OSDs: 1.5 == 2.4 == 3.3 == ...

Instead, if the HASHPSPOOL flag is set, we has the ps and pool together and
feed that into CRUSH.  This ensures that two adjacent pools will map to
an independent pseudorandom set of OSDs.

Advertise our support for this via a protocol feature flag.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: calculate placement based on the internal data types</title>
<updated>2013-02-26T23:02:37+00:00</updated>
<author>
<name>Sage Weil</name>
<email>sage@inktank.com</email>
</author>
<published>2013-02-26T00:13:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2169aea649c08374bec7d220a3b8f64712275356'/>
<id>2169aea649c08374bec7d220a3b8f64712275356</id>
<content type='text'>
Instead of using the old ceph_object_layout struct, update our internal
ceph_calc_object_layout method to use the ceph_pg type.  This allows us to
pass the full 32-bit precision of the pgid.seed to the callers.  It also
allows some callers to avoid reaching into the request structures for the
struct ceph_object_layout fields.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Instead of using the old ceph_object_layout struct, update our internal
ceph_calc_object_layout method to use the ceph_pg type.  This allows us to
pass the full 32-bit precision of the pgid.seed to the callers.  It also
allows some callers to avoid reaching into the request structures for the
struct ceph_object_layout fields.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@inktank.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ceph: update support for PGID64, PGPOOL3, OSDENC protocol features</title>
<updated>2013-02-26T23:02:25+00:00</updated>
<author>
<name>Sage Weil</name>
<email>sage@inktank.com</email>
</author>
<published>2013-02-23T18:41:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4f6a7e5ee1393ec4b243b39dac9f36992d161540'/>
<id>4f6a7e5ee1393ec4b243b39dac9f36992d161540</id>
<content type='text'>
Support (and require) the PGID64, PGPOOL3, and OSDENC protocol features.
These have been present in ceph.git since v0.42, Feb 2012.  Require these
features to simplify support; nobody is running older userspace.

Note that the new request and reply encoding is still not in place, so the new
code is not yet functional.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@inktank.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Support (and require) the PGID64, PGPOOL3, and OSDENC protocol features.
These have been present in ceph.git since v0.42, Feb 2012.  Require these
features to simplify support; nobody is running older userspace.

Note that the new request and reply encoding is still not in place, so the new
code is not yet functional.

Signed-off-by: Sage Weil &lt;sage@inktank.com&gt;
Reviewed-by: Alex Elder &lt;elder@inktank.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
