linux-stable.git/drivers/md/md.c, branch v3.5

md: avoid crash when stopping md array races with closing other open fds.

2012-07-19T05:59:18+00:00

md will refuse to stop an array if any other fd (or mounted fs) is
using it.
When any fs is unmounted of when the last open fd is closed all
pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put)
so there will be no pending IO to worry about when the array is
stopped.

However in order to send the STOP_ARRAY ioctl to stop the array one
must first get and open fd on the block device.
If some fd is being used to write to the block device and it is closed
after mdadm open the block device, but before mdadm issues the
STOP_ARRAY ioctl, then there will be no last-close on the md device so
__blkdev_put will not call sync_blockdev.

If this happens, then IO can still be in-flight while md tears down
the array and bad things can happen (use-after-free and subsequent
havoc).

So in the case where do_md_stop is being called from an open file
descriptor, call sync_block after taking the mutex to ensure there
will be no new openers.

This is needed when setting a read-write device to read-only too.

Cc: stable@vger.kernel.org
Reported-by: majianpeng 
Signed-off-by: NeilBrown

md: fix bug in handling of new_data_offset

2012-07-19T05:59:18+00:00

commit c6563a8c38fde3c1c7fc925a10bde3ca20799301
    md: add possibility to change data-offset for devices.

introduced a 'new_data_offset' attribute which should normally
be the same as 'data_offset', but can be explicitly set to a different
value to allow a reshape operation to move the data.

Unfortunately when the 'data_offset' is explicitly set through
sysfs, the new_data_offset is not also set, so the two would become
out-of-sync incorrectly.

One result of this is that trying to set the 'size' after the
'data_offset' would fail because it is not permitted to set the size
when the 'data_offset' and 'new_data_offset' are different - as that
can be confusing.
Consequently when mdadm tried to do this while assembling an IMSM
array it would fail.

This bug was introduced in 3.5-rc1.

Reported-by: Brian Downing 
Bisected-by: Brian Downing 
Tested-by: Brian Downing 
Signed-off-by: NeilBrown

md: support re-add of recovering devices.

2012-07-03T05:59:06+00:00

We currently only allow a device to be re-added if it appear to be
in-sync.  This is overly restrictive as it may be desirable to re-add
a device that is in the middle of recovery.

So remove the test for "InSync" - the test on rdev->raid_disk is
sufficient to ensure that the re-add will succeed.

Reported-by: Alexander Lyakas 
Tested-by: Alexander Lyakas 
Signed-off-by: NeilBrown

md: make 'name' arg to md_register_thread non-optional.

2012-07-03T05:56:52+00:00

Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.

So make it non-optional and always explicitly pass the name
of the level that the array will be.

Reported-by: majianpeng 
Signed-off-by: NeilBrown

md:Add blk_plug in sync_thread.

2012-07-03T02:12:26+00:00

Add blk_plug in sync_thread will increase the performance of sync.
Because sync_thread did not blk_plug,so when raid sync, the bio merge
not well.

Testing environment:
SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI
Controller.
OS:Linux xxx 3.5.0-rc2+ #340 SMP Tue Jun 12 09:00:25 CST 2012
x86_64 x86_64 x86_64 GNU/Linux.
RAID5: four ST31000524NS disk.

Without blk_plug:recovery speed about 63M/Sec;
Add blk_plug:recovery speed about 120M/Sec.

Using blktrace:
blktrace -d /dev/sdb -w 60  -o -|blkparse -i -

without blk_plug:
Total (8,16):
 Reads Queued:      309811,     1239MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:   283583,     1189MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:   273351,     1149MiB	 Writes Completed:        0,        0KiB
 Read Merges:        23533,    94132KiB	 Write Merges:            0,        0KiB
 IO unplugs:             0        	 Timer unplugs:           0

add blk_plug:
Total (8,16):
 Reads Queued:      428697,     1714MiB	 Writes Queued:           0,        0KiB
 Read Dispatches:     3954,     1714MiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:     3956,     1715MiB	 Writes Completed:        0,        0KiB
 Read Merges:       424743,     1698MiB	 Write Merges:            0,        0KiB
 IO unplugs:             0        	 Timer unplugs:        3384

The ratio of merge will be markedly increased.

Signed-off-by: majianpeng 
Signed-off-by: NeilBrown

md: check the return of mddev_find()

2012-05-22T03:55:32+00:00

Check the return of mddev_find(), since it may fail due to out of
memeory or out of usable minor number.

The reason I chose -ENODEV instead of -ENOMEM or something else is
md_alloc() function chose that ;)

Signed-off-by: Yuanhan Liu 
Signed-off-by: NeilBrown

DM RAID: Set recovery flags on resume

2012-05-22T03:55:29+00:00

Properly initialize MD recovery flags when resuming device-mapper devices.

When a device-mapper device is suspended, all I/O must stop.  This is done by
calling 'md_stop_writes' and 'mddev_suspend'.  These calls in-turn manipulate
the recovery flags - including setting 'MD_RECOVERY_FROZEN'.  The DM device
may have been suspended while recovery was not yet complete, so the process
needs to pick-up where it left off.  Since 'mddev_resume' does not unset
'MD_RECOVERY_FROZEN' and set 'MD_RECOVERY_NEEDED', we must do it ourselves.
'MD_RECOVERY_NEEDED' can safely be set in 'mddev_resume', but 'MD_RECOVERY_FROZEN'
must be set outside of 'mddev_resume' due to how MD handles RAID reshaping.
(e.g.  It is possible for a user to delay reshaping a RAID5->RAID6 by purposefully
setting 'MD_RECOVERY_FROZEN'.  Clearing it in 'mddev_resume' would override the
desired behavior.)

Because 'mddev_resume' already unconditionally calls 'md_wakeup_thread(mddev->thread)'
there is no need to make this call from 'raid_resume' since it calls 'mddev_resume'.

Also clean up where  level_store calls mddev_resume() - it current
duplicates some of the funcitons of that call. - NB

Signed-off-by: Jonathan Brassow 
Signed-off-by: NeilBrown

md: allow array to be resized while bitmap is present.

2012-05-22T03:55:27+00:00

Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.

This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.

Signed-off-by: NeilBrown

md/bitmap: move some fields of 'struct bitmap' into a 'storage' substruct.

2012-05-22T03:55:10+00:00

This new 'struct bitmap_storage' reflects the external storage of the
bitmap.
Having this clearly defined will make it easier to change the storage
used while the array is active.

Signed-off-by: NeilBrown

md/bitmap: allow a bitmap with no backing storage.

2012-05-22T03:55:08+00:00

An md bitmap comprises two parts
 - internal counting of active writes per 'chunk'.
 - external storage of whether there are any active writes on
   each chunk

The second requires the first, but the first doesn't require the
second.

Not having backing storage means that the bitmap cannot expedite
resync after a crash, but it still allows us to expedite the recovery
of a recently-removed device.

So: allow a bitmap to exist even if there is no backing device.
In that case we default to 128M chunks.

A particular value of this is that we can remove and re-add a bitmap
(possibly of a different granularity) on a degraded array, and not
lose the information needed to fast-recover the missing device.

We don't actually activate these bitmaps yet - that will come
in a later patch.

Signed-off-by: NeilBrown