linux.git/drivers/md/md.c, branch v2.6.29

md: fix deadlock when stopping arrays

2009-03-04T07:57:25+00:00

Resolve a deadlock when stopping redundant arrays, i.e. ones that
require a call to sysfs_remove_group when shutdown.  The deadlock is
summarized below:

Thread1                Thread2
-------                -------
read sysfs attribute   stop array
                       take mddev lock
                       sysfs_remove_group
sysfs_get_active
wait for mddev lock
                       wait for active

Sysrq-w:
--------
mdmon         S 00000017  2212  4163      1
  f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c
  c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc
  00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd
Call Trace:
  [] ? debug_mutex_add_waiter+0x1d/0x58
  [] __mutex_lock_common+0x1d9/0x338
  [] ? __mutex_lock_common+0x1d9/0x338
  [] mutex_lock_interruptible_nested+0x33/0x3a
  [] ? mddev_lock+0x14/0x16
  [] mddev_lock+0x14/0x16
  [] md_attr_show+0x2a/0x49
  [] sysfs_read_file+0x93/0xf9
mdadm         D 00000017  2812  4177      1
  f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c
  c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70
  00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000
Call Trace:
  [] schedule_timeout+0x1b/0x95
  [] ? schedule_timeout+0x1b/0x95
  [] ? wait_for_common+0x34/0xdc
  [] ? trace_hardirqs_on_caller+0x18/0x145
  [] ? trace_hardirqs_on+0xb/0xd
  [] wait_for_common+0xa0/0xdc
  [] ? default_wake_function+0x0/0x12
  [] wait_for_completion+0x17/0x19
  [] sysfs_addrm_finish+0x19f/0x1d1
  [] sysfs_hash_and_remove+0x42/0x55
  [] sysfs_remove_group+0x57/0x86
  [] do_md_stop+0x13a/0x499

This has been there for a while, but is easier to trigger now that mdmon
is closely watching sysfs.

Cc: 
Reported-by: Jacek Danecki 
Signed-off-by: Dan Williams

block: fix bad definition of BIO_RW_SYNC

2009-02-18T09:32:00+00:00

We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fec62ef4c3675621b9364a667954d4dd.

Signed-off-by: Jens Axboe

md: Ensure an md array never has too many devices.

2009-02-06T07:02:46+00:00

Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.

We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.

We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.

So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.

This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.

Cc: stable@kernel.org

Signed-off-by: NeilBrown

md: don't retry recovery of raid1 that fails due to error on source drive.

2009-01-08T21:31:11+00:00

If a raid1 has only one working drive and it has a sector which
gives an error on read, then an attempt to recover onto a spare will
fail, but as the single remaining drive is not removed from the
array, the recovery will be immediately re-attempted, resulting
in an infinite recovery loop.

So detect this situation and don't retry recovery once an error
on the lone remaining drive is detected.

Allow recovery to be retried once every time a spare is added
in case the problem wasn't actually a media error.

Signed-off-by: NeilBrown

md: Allow md devices to be created by name.

2009-01-08T21:31:10+00:00

Using sequential numbers to identify md devices is somewhat artificial.
Using names can be a lot more user-friendly.

Also, creating md devices by opening the device special file is a bit
awkward.

So this patch provides a new option for creating and naming devices.

Writing a name such as "md_home" to
    /sys/modules/md_mod/parameters/new_array
will cause an array with that name to be created.  It will appear in
/sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
It will have an arbitrary minor number allocated.

md devices that a created by an open are destroyed on the last
close when the device is inactive.
For named md devices, they will not be destroyed until the array
is explicitly stopped, either with the STOP_ARRAY ioctl or by
writing 'clear' to /sys/block/md_XXXX/md/array_state.

The name of the array must start 'md_' to avoid conflict with
other devices.

Signed-off-by: NeilBrown

md: make devices disappear when they are no longer needed.

2009-01-08T21:31:10+00:00

Currently md devices, once created, never disappear until the module
is unloaded.  This is essentially because the gendisk holds a
reference to the mddev, and the mddev holds a reference to the
gendisk, this a circular reference.

If we drop the reference from mddev to gendisk, then we need to ensure
that the mddev is destroyed when the gendisk is destroyed.  However it
is not possible to hook into the gendisk destruction process to enable
this.

So we drop the reference from the gendisk to the mddev and destroy the
gendisk when the mddev gets destroyed.  However this has a
complication.
Between the call
   __blkdev_get->get_gendisk->kobj_lookup->md_probe
and the call
   __blkdev_get->md_open

there is no obvious way to hold a reference on the mddev any more, so
unless something is done, it will disappear and gendisk will be
destroyed prematurely.

Also, once we decide to destroy the mddev, there will be an unlockable
moment before the gendisk is unlinked (blk_unregister_region) during
which a new reference to the gendisk can be created.  We need to
ensure that this reference can not be used.  i.e. the ->open must
fail.

So:
 1/  in md_probe we set a flag in the mddev (hold_active) which
     indicates that the array should be treated as active, even
     though there are no references, and no appearance of activity.
     This is cleared by md_release when the device is closed if it
     is no longer needed.
     This ensures that the gendisk will survive between md_probe and
     md_open.

 2/  In md_open we check if the mddev we expect to open matches
     the gendisk that we did open.
     If there is a mismatch we return -ERESTARTSYS and modify
     __blkdev_get to retry from the top in that case.
     In the -ERESTARTSYS sys case we make sure to wait until
     the old gendisk (that we succeeded in opening) is really gone so
     we loop at most once.

Some udev configurations will always open an md device when it first
appears.   If we allow an md device that was just created by an open
to disappear on an immediate close, then this can race with such udev
configurations and result in an infinite loop the device being opened
and closed, then re-open due to the 'ADD' even from the first open,
and then close and so on.
So we make sure an md device, once created by an open, remains active
at least until some md 'ioctl' has been made on it.  This means that
all normal usage of md devices will allow them to disappear promptly
when not needed, but the worst that an incorrect usage will do it
cause an inactive md device to be left in existence (it can easily be
removed).

As an array can be stopped by writing to a sysfs attribute
  echo clear > /sys/block/mdXXX/md/array_state
we need to use scheduled work for deleting the gendisk and other
kobjects.  This allows us to wait for any pending gendisk deletion to
complete by simply calling flush_scheduled_work().



Signed-off-by: NeilBrown

md: centralise all freeing of an 'mddev' in 'md_free'

2009-01-08T21:31:09+00:00

md_free is the .release handler for the md kobj_type.
So it makes sense to release all the objects referenced by
the mddev in there, rather than just prior to calling kobject_put
for what we think is the last time.

Signed-off-by: NeilBrown

md: move allocation of ->queue from mddev_find to md_probe

2009-01-08T21:31:08+00:00

It is more balanced to just do simple initialisation in mddev_find,
which allocates and links a new md device, and leave all the
more sophisticated allocation to md_probe (which calls mddev_find).
md_probe already allocated the gendisk.  It should allocate the
queue too.

Signed-off-by: NeilBrown

md: need another print_sb for mdp_superblock_1

2009-01-08T21:31:08+00:00

md_print_devices is called in two code path: MD_BUG(...), and md_ioctl
with PRINT_RAID_DEBUG.  it will dump out all in use md devices
information;

However, it wrongly processed two types of superblock in one:

The header file  has defined two types of superblock,
struct mdp_superblock_s (typedefed with mdp_super_t) according to md with
metadata 0.90, and struct mdp_superblock_1 according to md with metadata
1.0 and later,

These two types of superblock are very different,

The md_print_devices code processed them both in mdp_super_t, that would
lead to wrong informaton dump like:

	[ 6742.345877]
	[ 6742.345887] md:	**********************************
	[ 6742.345890] md:	*  *
	[ 6742.345892] md:	**********************************
	[ 6742.345896] md1: 
	[ 6742.345907] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
	[ 6742.345909] md: rdev superblock:
	[ 6742.345914] md:  SB: (V:0.90.0) ID:<42ef13c7.598c059a.5f9f1645.801e9ee6> CT:4919856d
	[ 6742.345918] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
	[ 6742.345922] md:     UT:4919856d ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:b7992907 E:00000001
	[ 6742.345924]      D  0:  DISK
	[ 6742.345930]      D  1:  DISK
	[ 6742.345933]      D  2:  DISK
	[ 6742.345937]      D  3:  DISK
	[ 6742.345942] md:     THIS:  DISK
	...
	[ 6742.346058] md0: 
	[ 6742.346067] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
	[ 6742.346070] md: rdev superblock:
	[ 6742.346073] md:  SB: (V:1.0.0) ID:<369aad81.00000000.00000000.00000000> CT:9a322a9c
	[ 6742.346077] md:     L-1507699579 S976570180 ND:48 RD:0 md0 LO:65536 CS:196610
	[ 6742.346081] md:     UT:00000018 ST:0 AD:131048 WD:0 FD:8 SD:0 CSUM:00000000 E:00000000
	[ 6742.346084]      D  0:  DISK
	[ 6742.346089]      D  1:  DISK
	[ 6742.346092]      D  2:  DISK
	[ 6742.346096]      D  3:  DISK
	[ 6742.346102] md:     THIS:  DISK
	...
	[ 6742.346219] md:	**********************************
	[ 6742.346221]

Here md1 is metadata 0.90.0, and md0 is metadata 1.2

After some more code to distinguish these two types of superblock, in this patch,

it will generate dump information like:

	[ 7906.755790]
	[ 7906.755799] md:	**********************************
	[ 7906.755802] md:	*  *
	[ 7906.755804] md:	**********************************
	[ 7906.755808] md1: 
	[ 7906.755819] md: rdev ram7, SZ:00065472 F:0 S:1 DN:3
	[ 7906.755821] md: rdev superblock (MJ:0):
	[ 7906.755826] md:  SB: (V:0.90.0) ID:<3fca7a0d.a612bfed.5f9f1645.801e9ee6> CT:491989f3
	[ 7906.755830] md:     L5 S00065472 ND:4 RD:4 md1 LO:2 CS:65536
	[ 7906.755834] md:     UT:491989f3 ST:1 AD:4 WD:4 FD:0 SD:0 CSUM:00fb52ad E:00000001
	[ 7906.755836]      D  0:  DISK
	[ 7906.755842]      D  1:  DISK
	[ 7906.755845]      D  2:  DISK
	[ 7906.755849]      D  3:  DISK
	[ 7906.755855] md:     THIS:  DISK
	...
	[ 7906.755972] md0: 
	[ 7906.755981] md: rdev ram3, SZ:00065472 F:0 S:1 DN:3
	[ 7906.755984] md: rdev superblock (MJ:1):
	[ 7906.755989] md:  SB: (V:1) (F:0) Array-ID:<5fbcf158:55aa:5fbe:9a79:1e939880dcbd>
	[ 7906.755990] md:    Name: "DG5:0" CT:1226410480
	[ 7906.755998] md:       L5 SZ130944 RD:4 LO:2 CS:128 DO:24 DS:131048 SO:8 RO:0
	[ 7906.755999] md:     Dev:00000003 UUID: 9194d744:87f7:a448:85f2:7497b84ce30a
	[ 7906.756001] md:       (F:0) UT:1226410480 Events:0 ResyncOffset:-1 CSUM:0dbcd829
	[ 7906.756003] md:         (MaxDev:384)
	...
	[ 7906.756113] md:	**********************************
	[ 7906.756116]

this md0 (metadata 1.2) information dumping is exactly according to struct
mdp_superblock_1.

Signed-off-by: Cheng Renquan 
Cc: Neil Brown 
Cc: Dan Williams 
Signed-off-by: Andrew Morton 
Signed-off-by: NeilBrown

md: use list_for_each_entry macro directly

2009-01-08T21:31:08+00:00

The rdev_for_each macro defined in  is identical to
list_for_each_entry_safe, from , it should be defined to
use list_for_each_entry_safe, instead of reinventing the wheel.

But some calls to each_entry_safe don't really need a safe version,
just a direct list_for_each_entry is enough, this could save a temp
variable (tmp) in every function that used rdev_for_each.

In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
totally save many tmp vars; and only in the other situations that will call
list_del to delete an entry, the safe version is used.

Signed-off-by: Cheng Renquan 
Signed-off-by: NeilBrown