linux-stable.git/fs/exofs, branch v3.2.78

ore: Fix wrong math in allocation of per device BIO

2014-04-01T23:58:45+00:00

commit aad560b7f63b495f48a7232fd086c5913a676e6f upstream.

At IO preparation we calculate the max pages at each device and
allocate a BIO per device of that size. The calculation was wrong
on some unaligned corner cases offset/length combination and would
make prepare return with -ENOMEM. This would be bad for pnfs-objects
that would in that case IO through MDS. And fatal for exofs were it
would fail writes with EIO.

Fix it by doing the proper math, that will work in all cases. (I
ran a test with all possible offset/length combinations this time
round).

Also when reading we do not need to allocate for the parity units
since we jump over them.

Also lower the max_io_length to take into account the parity pages
so not to allocate BIOs bigger than PAGE_SIZE

Signed-off-by: Boaz Harrosh 
Signed-off-by: Ben Hutchings

block: Add bio_for_each_segment_all()

2013-09-10T00:57:27+00:00

commit d74c6d514fe314b8bdab58b487b25992291577ec upstream.

__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx.  Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.

Signed-off-by: Kent Overstreet 
CC: Jens Axboe 
CC: Neil Brown 
CC: Boaz Harrosh 
[bwh: Backported to 3.2: drop inapplicable change to drivers/block/rbd.c.
 This is a prerequisite for commit 35dc248383bb 'sg: Fix user memory
 corruption when SG_IO is interrupted by a signal']
Signed-off-by: Ben Hutchings

ore: Fix out-of-bounds access in _ios_obj()

2012-08-09T23:25:12+00:00

commit 9e62bb4458ad2cf28bd701aa5fab380b846db326 upstream.

_ios_obj() is accessed by group_index not device_table index.

The oc->comps array is only a group_full of devices at a time
it is not like ore_comp_dev() which is indexed by a global
device_table index.

This did not BUG until now because exofs only uses a single
COMP for all devices. But with other FSs like PanFS this is
not true.

This bug was only in the write_path, all other users were
using it correctly

[This is a bug since 3.2 Kernel]

Signed-off-by: Boaz Harrosh 
Signed-off-by: Ben Hutchings

ore: Remove support of partial IO request (NFS crash)

2012-07-25T03:11:30+00:00

commit 62b62ad873f2accad9222a4d7ffbe1e93f6714c1 upstream.

Do to OOM situations the ore might fail to allocate all resources
needed for IO of the full request. If some progress was possible
it would proceed with a partial/short request, for the sake of
forward progress.

Since this crashes NFS-core and exofs is just fine without it just
remove this contraption, and fail.

TODO:
	Support real forward progress with some reserved allocations
	of resources, such as mem pools and/or bio_sets

[Bug since 3.2 Kernel]
CC: Benny Halevy 
Signed-off-by: Boaz Harrosh 
Signed-off-by: Ben Hutchings

ore: Fix NFS crash by supporting any unaligned RAID IO

2012-07-25T03:11:29+00:00

commit 9ff19309a9623f2963ac5a136782ea4d8b5d67fb upstream.

In RAID_5/6 We used to not permit an IO that it's end
byte is not stripe_size aligned and spans more than one stripe.
.i.e the caller must check if after submission the actual
transferred bytes is shorter, and would need to resubmit
a new IO with the remainder.

Exofs supports this, and NFS was supposed to support this
as well with it's short write mechanism. But late testing has
exposed a CRASH when this is used with none-RPC layout-drivers.

The change at NFS is deep and risky, in it's place the fix
at ORE to lift the limitation is actually clean and simple.
So here it is below.

The principal here is that in the case of unaligned IO on
both ends, beginning and end, we will send two read requests
one like old code, before the calculation of the first stripe,
and also a new site, before the calculation of the last stripe.
If any "boundary" is aligned or the complete IO is within a single
stripe. we do a single read like before.

The code is clean and simple by splitting the old _read_4_write
into 3 even parts:
1._read_4_write_first_stripe
2. _read_4_write_last_stripe
3. _read_4_write_execute

And calling 1+3 at the same place as before. 2+3 before last
stripe, and in the case of all in a single stripe then 1+2+3
is preformed additively.

Why did I not think of it before. Well I had a strike of
genius because I have stared at this code for 2 years, and did
not find this simple solution, til today. Not that I did not try.

This solution is much better for NFS than the previous supposedly
solution because the short write was dealt  with out-of-band after
IO_done, which would cause for a seeky IO pattern where as in here
we execute in order. At both solutions we do 2 separate reads, only
here we do it within a single IO request. (And actually combine two
writes into a single submission)

NFS/exofs code need not change since the ORE API communicates the new
shorter length on return, what will happen is that this case would not
occur anymore.

hurray!!

[Stable this is an NFS bug since 3.2 Kernel should apply cleanly]
Signed-off-by: Boaz Harrosh 
Signed-off-by: Ben Hutchings

exofs: Fix CRASH on very early IO errors.

2012-06-10T13:41:33+00:00

commit 6abe4a87f7bc7978705c386dbba0ca0c7790b3ec upstream.

If at exofs_fill_super() we had an early termination
do to any error, like an IO error while reading the
super-block. We would crash inside exofs_free_sbi().

This is because sbi->oc.numdevs was set to 1, before
we actually have a device table at all.

Fix it by moving the sbi->oc.numdevs = 1 to after the
allocation of the device table.

Reported-by: Johannes Schild 

Stable: This is a bug since v3.2.0
Signed-off-by: Boaz Harrosh 
Signed-off-by: Ben Hutchings

ore: FIX breakage when MISC_FILESYSTEMS is not set

2012-01-12T19:29:29+00:00

commit 831c2dc5f47c1dc79c32229d75065ada1dcc66e1 upstream.

As Reported by Randy Dunlap

When MISC_FILESYSTEMS is not enabled and NFS4.1 is:

fs/built-in.o: In function `objio_alloc_io_state':
objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state'
fs/built-in.o: In function `_write_done':
objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io'
fs/built-in.o: In function `_read_done':
...

When MISC_FILESYSTEMS, which is more of a GUI thing then anything else,
is not selected. exofs/Kconfig is never examined during Kconfig,
and it can not do it's magic stuff to automatically select everything
needed.

We must split exofs/Kconfig in two. The ore one is always included.
And the exofs one is left in it's old place in the menu.

Reported-by: Randy Dunlap 
Signed-off-by: Boaz Harrosh 
Signed-off-by: Greg Kroah-Hartman

ore: Must support none-PAGE-aligned IO

2012-01-12T19:29:29+00:00

commit 724577ca355795b0a25c93ccbeee927871ca1a77 upstream.

NFS might send us offsets that are not PAGE aligned. So
we must read in the reminder of the first/last pages, in cases
we need it for Parity calculations.

We only add an sg segments to read the partial page. But
we don't mark it as read=true because it is a lock-for-write
page.

TODO: In some cases (IO spans a single unit) we can just
adjust the raid_unit offset/length, but this is left for
later Kernels.

Signed-off-by: Boaz Harrosh 
Signed-off-by: Greg Kroah-Hartman

ore: fix BUG_ON, too few sgs when reading

2012-01-12T19:29:28+00:00

commit 361aba569f55dd159b850489a3538253afbb3973 upstream.

When reading RAID5 files, in rare cases, we calculated too
few sg segments. There should be two extra for the beginning
and end partial units.

Also "too few sg segments" should not be a BUG_ON there is
all the mechanics in place to handle it, as a short read.
So just return -ENOMEM and the rest of the code will gracefully
split the IO.

Signed-off-by: Boaz Harrosh 
Signed-off-by: Greg Kroah-Hartman

ore: Fix crash in case of an IO error.

2012-01-12T19:29:27+00:00

commit ffefb8eaa367e8a5c14f779233d9da1fbc23d164 upstream.

The users of ore_check_io() expect the reported device
(In case of error) to be indexed relative to the passed-in
ore_components table, and not the logical dev index.

This causes a crash inside objlayoutdriver in case of
an IO error.

Signed-off-by: Boaz Harrosh 
Signed-off-by: Greg Kroah-Hartman