linux-stable.git/drivers/md, branch linux-2.6.28.y

md: fix deadlock when stopping arrays

2009-05-02T17:57:17+00:00

[backport of 5fd3a17ed456637a224cf4ca82b9ad9d005bc8d4]

Resolve a deadlock when stopping redundant arrays, i.e. ones that
require a call to sysfs_remove_group when shutdown.  The deadlock is
summarized below:

Thread1                Thread2
-------                -------
read sysfs attribute   stop array
                       take mddev lock
                       sysfs_remove_group
sysfs_get_active
wait for mddev lock
                       wait for active

Sysrq-w:
  --------
mdmon         S 00000017  2212  4163      1
  f1982ea8 00000046 2dcf6b85 00000017 c0b23100 f2f83ed0 c0b23100 f2f8413c
  c0b23100 c0b23100 c0b1fb98 f2f8413c 00000000 f2f8413c c0b23100 f2291ecc
  00000002 c0b23100 00000000 00000017 f2f83ed0 f1982eac 00000046 c044d9dd
Call Trace:
  [] ? debug_mutex_add_waiter+0x1d/0x58
  [] __mutex_lock_common+0x1d9/0x338
  [] ? __mutex_lock_common+0x1d9/0x338
  [] mutex_lock_interruptible_nested+0x33/0x3a
  [] ? mddev_lock+0x14/0x16
  [] mddev_lock+0x14/0x16
  [] md_attr_show+0x2a/0x49
  [] sysfs_read_file+0x93/0xf9
mdadm         D 00000017  2812  4177      1
  f0401d78 00000046 430456f8 00000017 f0401d58 f0401d20 c0b23100 f2da2c4c
  c0b23100 c0b23100 c0b1fb98 f2da2c4c 0a10fc36 00000000 c0b23100 f0401d70
  00000003 c0b23100 00000000 00000017 f2da29e0 00000001 00000002 00000000
Call Trace:
  [] schedule_timeout+0x1b/0x95
  [] ? schedule_timeout+0x1b/0x95
  [] ? wait_for_common+0x34/0xdc
  [] ? trace_hardirqs_on_caller+0x18/0x145
  [] ? trace_hardirqs_on+0xb/0xd
  [] wait_for_common+0xa0/0xdc
  [] ? default_wake_function+0x0/0x12
  [] wait_for_completion+0x17/0x19
  [] sysfs_addrm_finish+0x19f/0x1d1
  [] sysfs_hash_and_remove+0x42/0x55
  [] sysfs_remove_group+0x57/0x86
  [] do_md_stop+0x13a/0x499

This has been there for a while, but is easier to trigger now that mdmon
is closely watching sysfs.

Cc: Neil Brown 
Reported-by: Jacek Danecki 
Signed-off-by: Dan Williams 
Signed-off-by: Greg Kroah-Hartman

dm crypt: wait for endio to complete before destruction

2009-03-23T21:55:26+00:00

commit b35f8caa0890169000fec22902290d9a15274cbd upstream.

The following oops has been reported when dm-crypt runs over a loop device.

...
[   70.381058] Process loop0 (pid: 4268, ti=cf3b2000 task=cf1cc1f0 task.ti=cf3b2000)
...
[   70.381058] Call Trace:
[   70.381058]  [] ? crypt_dec_pending+0x5e/0x62 [dm_crypt]
[   70.381058]  [] ? crypt_endio+0xa2/0xaa [dm_crypt]
[   70.381058]  [] ? crypt_endio+0x0/0xaa [dm_crypt]
[   70.381058]  [] ? bio_endio+0x2b/0x2e
[   70.381058]  [] ? dec_pending+0x224/0x23b [dm_mod]
[   70.381058]  [] ? clone_endio+0x79/0xa4 [dm_mod]
[   70.381058]  [] ? clone_endio+0x0/0xa4 [dm_mod]
[   70.381058]  [] ? bio_endio+0x2b/0x2e
[   70.381058]  [] ? loop_thread+0x380/0x3b7
[   70.381058]  [] ? do_lo_send_aops+0x0/0x165
[   70.381058]  [] ? autoremove_wake_function+0x0/0x33
[   70.381058]  [] ? loop_thread+0x0/0x3b7

When a table is being replaced, it waits for I/O to complete
before destroying the mempool, but the endio function doesn't
call mempool_free() until after completing the bio.

Fix it by swapping the order of those two operations.

The same problem occurs in dm.c with md referenced after dec_pending.
Again, we swap the order.

Signed-off-by: Milan Broz 
Signed-off-by: Alasdair G Kergon 
Signed-off-by: Greg Kroah-Hartman

dm crypt: fix kcryptd_async_done parameter

2009-03-23T21:55:26+00:00

commit b2174eebd1fadb76454dad09a1dacbc17081e6b0 upstream.

In the async encryption-complete function (kcryptd_async_done), the
crypto_async_request passed in may be different from the one passed to
crypto_ablkcipher_encrypt/decrypt.  Only crypto_async_request->data is
guaranteed to be same as the one passed in.  The current
kcryptd_async_done uses the passed-in crypto_async_request directly
which may cause the AES-NI-based AES algorithm implementation to panic.

This patch fixes this bug by only using crypto_async_request->data,
which points to dm_crypt_request, the crypto_async_request passed in.
The original data (convert_context) is gotten from dm_crypt_request.

[mbroz@redhat.com: reworked]
Signed-off-by: Huang Ying 
Cc: Herbert Xu 
Signed-off-by: Milan Broz 
Signed-off-by: Andrew Morton 
Signed-off-by: Alasdair G Kergon 
Signed-off-by: Greg Kroah-Hartman

dm io: respect BIO_MAX_PAGES limit

2009-03-23T21:55:26+00:00

commit d659e6cc98766a1a61d6bdd283f95d149abd7719 upstream.

dm-io calls bio_get_nr_vecs to get the maximum number of pages to use
for a given device.  It allocates one additional bio_vec to use
internally but failed to respect BIO_MAX_PAGES, so fix this.

This was the likely cause of:
  https://bugzilla.redhat.com/show_bug.cgi?id=173153

Signed-off-by: Mikulas Patocka 
Signed-off-by: Alasdair G Kergon 
Signed-off-by: Greg Kroah-Hartman

dm ioctl: validate name length when renaming

2009-03-23T21:55:25+00:00

commit bc0fd67feba2e0770aad85393500ba77c6489f1c upstream.

When renaming a mapped device validate the length of the new name.

The rename ioctl accepted any correctly-terminated string enclosed
within the data passed from userspace.  The other ioctls enforce a
size limit of DM_NAME_LEN.  If the name is changed and becomes longer
than that, the device can no longer be addressed by name.

Fix it by properly checking for device name length (including
terminating zero).

Signed-off-by: Milan Broz 
Reviewed-by: Jonathan Brassow 
Reviewed-by: Alasdair G Kergon 
Signed-off-by: Alasdair G Kergon 
Signed-off-by: Greg Kroah-Hartman

md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery.

2009-03-17T00:32:09+00:00

commit 09b4068a7fe442efc40e9dcbcf5ff37c3338ab15 upstream.

When doing recovery on a raid10 with a write-intent bitmap, we only
need to recovery chunks that are flagged in the bitmap.

However if we choose to skip a chunk as it isn't flag, the code
currently skips the whole raid10-chunk, thus it might not recovery
some blocks that need recovering.

This patch fixes it.

In case that is confusing, it might help to understand that there
is a 'raid10 chunk size' which guides how data is distributed across
the devices, and a 'bitmap chunk size' which says how much data
corresponds to a single bit in the bitmap.

This bug only affects cases where the bitmap chunk size is smaller
than the raid10 chunk size.



Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery.

2009-03-17T00:32:09+00:00

commit 78200d45cde2a79c0d0ae0407883bb264caa3c18 upstream.

For raid1/4/5/6, resync (fixing inconsistencies between devices) is
very similar to recovery (rebuilding a failed device onto a spare).
The both walk through the device addresses in order.

For raid10 it can be quite different.  resync follows the 'array'
address, and makes sure all copies are the same.  Recover walks
through 'device' addresses and recreates each missing block.

The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
(When present) to be updated to reflect a partially completed resync.
It makes assumptions which mean that it does not work correctly for
raid10 recovery at all.

In particularly, it can cause bitmap-directed recovery of a raid10 to
not recovery some of the blocks that need to be recovered.

So move the call to bitmap_cond_end_sync into the resync path, rather
than being in the common "resync or recovery" path.


Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: avoid races when stopping resync.

2009-03-17T00:32:08+00:00

commit 73d5c38a9536142e062c35997b044e89166e063b upstream.

There has been a race in raid10 and raid1 for a long time
which has only recently started showing up due to a scheduler changed.

When a sync_read request finishes, as soon as reschedule_retry
is called, another thread can mark the resync request as having
completed, so md_do_sync can finish, ->stop can be called, and
->conf can be freed.  So using conf after reschedule_retry is not
safe.

Similarly, when finishing a sync_write, calling md_done_sync must be
the last thing we do, as it allows a chain of events which will free
conf and other data structures.

The first of these requires action in raid10.c
The second requires action in raid1.c and raid10.c

Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: Fix a bug in linear.c causing which_dev() to return the wrong device.

2009-02-12T17:50:25+00:00

commit 852c8bf484a0e17ee27f413ef26e87f522af5607 upstream.

ab5bd5cbc8d4b868378d062eed3d4240930fbb86 introduced the following
bug in linear software raid for large arrays on 32 bit machines:

which_dev() computes the device holding a given sector by shifting
down the sector number to a 32 bit range, dividing by the array
spacing and looking up the resulting index in the hash table of
the array.

Because the computed index might be slightly too small, a loop at
the end of which_dev() increases the index until the given sector
actually falls into the range of the device associated with that index.

The changes of the above mentioned commit caused this loop to check
whether the _index_ rather than the sector number is small enough,
effectively bypassing the loop and thus possibly returning the wrong
device.

As reported by Simon Kirby, this leads to errors such as

	linear_make_request: Sector 2340486136 out of bounds on dev sdi: 156301312 sectors, offset 2109870464

Fix this bug by introducing a local variable for the index so that
the variable containing the passed sector is left unchanged.

Signed-off-by: Andre Noll 
Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman

md: Ensure an md array never has too many devices.

2009-02-12T17:50:24+00:00

commit de01dfadf25bf83cfe3d85c163005c4320532658 upstream.

Each different metadata format supported by md supports a
different maximum number of devices.
We really should be enforcing this maximum in the kernel, but
we aren't quite doing that properly.

We currently only enforce it at the 'hot_add' point, which is an
older interface which is not used by current userspace.

We need to also enforce it at 'add_new_disk' time for active arrays
and at 'do_md_run' time when starting a new array.

So move the test from 'hot_add' into 'bind_rdev_to_array' which is
called from both 'hot_add' and 'add_new_disk, and add a new
test in 'analyse_sbs' which is called from 'do_md_run'.

This bug (or missing feature) has been around "forever" and so
the patch is suitable for any -stable that is currently maintained.

Signed-off-by: NeilBrown 
Signed-off-by: Greg Kroah-Hartman