linux-stable.git/drivers/md, branch v3.2.54

dm bufio: initialize read-only module parameters

2014-01-03T04:33:30+00:00

commit 4cb57ab4a2e61978f3a9b7d4f53988f30d61c27f upstream.

Some module parameters in dm-bufio are read-only. These parameters
inform the user about memory consumption. They are not supposed to be
changed by the user.

However, despite being read-only, these parameters can be set on
modprobe or insmod command line, for example:
modprobe dm-bufio current_allocated_bytes=12345

The kernel doesn't expect that these variables can be non-zero at module
initialization and if the user sets them, it results in BUG.

This patch initializes the variables in the module init routine, so that
user-supplied values are ignored.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Ben Hutchings

dm table: fail dm_table_create on dm_round_up overflow

2014-01-03T04:33:29+00:00

commit 5b2d06576c5410c10d95adfd5c4d8b24de861d87 upstream.

The dm_round_up function may overflow to zero.  In this case,
dm_table_create() must fail rather than go on to allocate an empty array
with alloc_targets().

This fixes a possible memory corruption that could be caused by passing
too large a number in "param->target_count".

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Ben Hutchings

dm snapshot: avoid snapshot space leak on crash

2014-01-03T04:33:29+00:00

commit 230c83afdd9cd384348475bea1e14b80b3b6b1b8 upstream.

There is a possible leak of snapshot space in case of crash.

The reason for space leaking is that chunks in the snapshot device are
allocated sequentially, but they are finished (and stored in the metadata)
out of order, depending on the order in which copying finished.

For example, supposed that the metadata contains the following records
SUPERBLOCK
METADATA (blocks 0 ... 250)
DATA 0
DATA 1
DATA 2
...
DATA 250

Now suppose that you allocate 10 new data blocks 251-260. Suppose that
copying of these blocks finish out of order (block 260 finished first
and the block 251 finished last). Now, the snapshot device looks like
this:
SUPERBLOCK
METADATA (blocks 0 ... 250, 260, 259, 258, 257, 256)
DATA 0
DATA 1
DATA 2
...
DATA 250
DATA 251
DATA 252
DATA 253
DATA 254
DATA 255
METADATA (blocks 255, 254, 253, 252, 251)
DATA 256
DATA 257
DATA 258
DATA 259
DATA 260

Now, if the machine crashes after writing the first metadata block but
before writing the second metadata block, the space for areas DATA 250-255
is leaked, it contains no valid data and it will never be used in the
future.

This patch makes dm-snapshot complete exceptions in the same order they
were allocated, thus fixing this bug.

Note: when backporting this patch to the stable kernel, change the version
field in the following way:
* if version in the stable kernel is {1, 11, 1}, change it to {1, 12, 0}
* if version in the stable kernel is {1, 10, 0} or {1, 10, 1}, change it
  to {1, 10, 2}
Userspace reads the version to determine if the bug was fixed, so the
version change is needed.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Ben Hutchings

dm delay: fix a possible deadlock due to shared workqueue

2014-01-03T04:33:22+00:00

commit 718822c1c112dc99e0c72c8968ee1db9d9d910f0 upstream.

The dm-delay target uses a shared workqueue for multiple instances.  This
can cause deadlock if two or more dm-delay targets are stacked on the top
of each other.

This patch changes dm-delay to use a per-instance workqueue.


Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Ben Hutchings

dm mpath: fix race condition between multipath_dtr and pg_init_done

2014-01-03T04:33:16+00:00

commit 954a73d5d3073df2231820c718fdd2f18b0fe4c9 upstream.

Whenever multipath_dtr() is happening we must prevent queueing any
further path activation work.  Implement this by adding a new
'pg_init_disabled' flag to the multipath structure that denotes future
path activation work should be skipped if it is set.  By disabling
pg_init and then re-enabling in flush_multipath_work() we also avoid the
potential for pg_init to be initiated while suspending an mpath device.

Without this patch a race condition exists that may result in a kernel
panic:

1) If after pg_init_done() decrements pg_init_in_progress to 0, a call
   to wait_for_pg_init_completion() assumes there are no more pending path
   management commands.
2) If pg_init_required is set by pg_init_done(), due to retryable
   mode_select errors, then process_queued_ios() will again queue the
   path activation work.
3) If free_multipath() completes before activate_path() work is called a
   NULL pointer dereference like the following can be seen when
   accessing members of the recently destructed multipath:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
RIP: 0010:[]  [] activate_path+0x1b/0x30 [dm_multipath]
[] worker_thread+0x170/0x2a0
[] ? autoremove_wake_function+0x0/0x40

[switch to disabling pg_init in flush_multipath_work & header edits by Mike Snitzer]
Signed-off-by: Shiva Krishna Merla 
Reviewed-by: Krishnasamy Somasundaram 
Tested-by: Speagle Andy 
Acked-by: Junichi Nomura 
Signed-off-by: Mike Snitzer 
[bwh: Backported to 3.2:
 - Adjust context
 - Bump version to 1.3.2 not 1.6.0]
Signed-off-by: Ben Hutchings

dm: allocate buffer for messages with small number of arguments using GFP_NOIO

2014-01-03T04:33:16+00:00

commit f36afb3957353d2529cb2b00f78fdccd14fc5e9c upstream.

dm-mpath and dm-thin must process messages even if some device is
suspended, so we allocate argv buffer with GFP_NOIO. These messages have
a small fixed number of arguments.

On the other hand, dm-switch needs to process bulk data using messages
so excessive use of GFP_NOIO could cause trouble.

The patch also lowers the default number of arguments from 64 to 8, so
that there is smaller load on GFP_NOIO allocations.

Signed-off-by: Mikulas Patocka 
Acked-by: Alasdair G Kergon 
Signed-off-by: Mike Snitzer 
Signed-off-by: Ben Hutchings

dm snapshot: fix data corruption

2013-11-28T14:02:04+00:00

commit e9c6a182649f4259db704ae15a91ac820e63b0ca upstream.

This patch fixes a particular type of data corruption that has been
encountered when loading a snapshot's metadata from disk.

When we allocate a new chunk in persistent_prepare, we increment
ps->next_free and we make sure that it doesn't point to a metadata area
by further incrementing it if necessary.

When we load metadata from disk on device activation, ps->next_free is
positioned after the last used data chunk. However, if this last used
data chunk is followed by a metadata area, ps->next_free is positioned
erroneously to the metadata area. A newly-allocated chunk is placed at
the same location as the metadata area, resulting in data or metadata
corruption.

This patch changes the code so that ps->next_free skips the metadata
area when metadata are loaded in function read_exceptions.

The patch also moves a piece of code from persistent_prepare_exception
to a separate function skip_metadata to avoid code duplication.

CVE-2013-4299

Signed-off-by: Mikulas Patocka 
Cc: Mike Snitzer 
Signed-off-by: Alasdair G Kergon 
Signed-off-by: Ben Hutchings

dm-snapshot: fix performance degradation due to small hash size

2013-10-26T20:06:05+00:00

commit 60e356f381954d79088d0455e357db48cfdd6857 upstream.

LVM2, since version 2.02.96, creates origin with zero size, then loads
the snapshot driver and then loads the origin.  Consequently, the
snapshot driver sees the origin size zero and sets the hash size to the
lower bound 64.  Such small hash table causes performance degradation.

This patch changes it so that the hash size is determined by the size of
snapshot volume, not minimum of origin and snapshot size.  It doesn't
make sense to set the snapshot size significantly larger than the origin
size, so we do not need to take origin size into account when
calculating the hash size.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Signed-off-by: Ben Hutchings

block: Add bio_for_each_segment_all()

2013-09-10T00:57:27+00:00

commit d74c6d514fe314b8bdab58b487b25992291577ec upstream.

__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx.  Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.

Signed-off-by: Kent Overstreet 
CC: Jens Axboe 
CC: Neil Brown 
CC: Boaz Harrosh 
[bwh: Backported to 3.2: drop inapplicable change to drivers/block/rbd.c.
 This is a prerequisite for commit 35dc248383bb 'sg: Fix user memory
 corruption when SG_IO is interrupted by a signal']
Signed-off-by: Ben Hutchings

md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.

2013-06-19T01:17:02+00:00

commit 3056e3aec8d8ba61a0710fb78b2d562600aa2ea7 upstream.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io()  calls call_bio_endio()
- As a result, in call_bio_endio():
        if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Signed-off-by: Alex Lyakas 
Signed-off-by: NeilBrown 
[bwh: Backported to 3.2: for raid10, s/rdev/conf->mirrors[dev].rdev/]
Signed-off-by: Ben Hutchings