linux-stable.git/drivers/md/raid5.c, branch linux-4.5.y

md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list

2016-04-12T14:33:39+00:00

commit 550da24f8d62fe81f3c13e3ec27602d6e44d43dc upstream.

break_stripe_batch_list breaks up a batch and copies some flags from
the batch head to the members, preserving others.

It doesn't preserve or copy STRIPE_PREREAD_ACTIVE. This is not
normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
stripe_head is added to a batch, and is not set on stripe_heads
already in a batch.

However there is no locking to ensure one thread doesn't set the flag
after it has just been cleared in another. This does occasionally happen.

md/raid5 maintains a count of the number of stripe_heads with
STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes. When
break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
this could becomes incorrect and will never again return to zero.

md/raid5 delays the handling of some stripe_heads until
preread_active_stripes becomes zero. So when the above mention race
happens, those stripe_heads become blocked and never progress,
resulting is write to the array handing.

So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
in the members of a batch.

URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
Reported-by: Martin Svec (and others)
Tested-by: Tom Weber
Fixes: 1b956f7a8f9a ("md/raid5: be more selective about distributing flags across batch.")
Signed-off-by: NeilBrown
Signed-off-by: Shaohua Li
Signed-off-by: Greg Kroah-Hartman

RAID5: revert e9e4c377e2f563 to fix a livelock

2016-04-12T14:33:38+00:00

commit 6ab2a4b806ae21b6c3e47c5ff1285ec06d505325 upstream.

Revert commit
e9e4c377e2f563(md/raid5: per hash value and exclusive wait_for_stripe)

The problem is raid5_get_active_stripe waits on
conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
in this order:
- release all stripes with hash 0
- raid5_get_active_stripe still sleeps since active_stripes >
  max_nr_stripes * 3 / 4
- release all stripes with hash other than 0. active_stripes becomes 0
- raid5_get_active_stripe still sleeps, since nobody wakes up
  wait_for_stripe[0]
The system live locks. The problem is active_stripes isn't a per-hash
count. Revert the patch makes the live lock go away.

Cc: Yuanhan Liu 
Cc: NeilBrown 
Signed-off-by: Shaohua Li 
Signed-off-by: Greg Kroah-Hartman

RAID5: check_reshape() shouldn't call mddev_suspend

2016-04-12T14:33:38+00:00

commit 27a353c026a879a1001e5eac4bda75b16262c44a upstream.

check_reshape() is called from raid5d thread. raid5d thread shouldn't
call mddev_suspend(), because mddev_suspend() waits for all IO finish
but IO is handled in raid5d thread, we could easily deadlock here.

This issue is introduced by
738a273 ("md/raid5: fix allocation of 'scribble' array.")

Reported-and-tested-by: Artur Paszkiewicz 
Reviewed-by: NeilBrown 
Signed-off-by: Shaohua Li 
Signed-off-by: Greg Kroah-Hartman

md/raid5: Compare apples to apples (or sectors to sectors)

2016-04-12T14:33:38+00:00

commit e7597e69dec59b65c5525db1626b9d34afdfa678 upstream.

'max_discard_sectors' is in sectors, while 'stripe' is in bytes.

This fixes the problem where DISCARD would get disabled on some larger
RAID5 configurations (6 or more drives in my testing), while it worked
as expected with smaller configurations.

Fixes: 620125f2bf8 ("MD: raid5 trim support")
Signed-off-by: Jes Sorensen 
Signed-off-by: Shaohua Li 
Signed-off-by: Greg Kroah-Hartman

MD: rename some functions

2016-01-20T21:52:20+00:00

These short function names are hard to search. Rename them to make vim happy.

Signed-off-by: Shaohua Li

raid5-cache: add journal hot add/remove support

2016-01-06T00:39:57+00:00

Add support for journal disk hot add/remove. Mostly trival checks in md
part. The raid5 part is a little tricky. For hot-remove, we can't wait
pending write as it's called from raid5d. The wait will cause deadlock.
We simplily fail the hot-remove. A hot-remove retry can success
eventually since if journal disk is faulty all pending write will be
failed and finish. For hot-add, since an array supporting journal but
without journal disk will be marked read-only, we are safe to hot add
journal without stopping IO (should be read IO, while journal only
handles write IO).

Signed-off-by: Shaohua Li 
Signed-off-by: NeilBrown

md/raid5: remove redundant check in stripe_add_to_batch_list()

2016-01-06T00:38:22+00:00

The stripe_add_to_batch_list() function is called only if
stripe_can_batch() returned true, so there is no need for double check.

Signed-off-by: Roman Gushchin 
Cc: Neil Brown 
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown

Merge tag 'md/4.4' of git://neil.brown.name/md

2015-11-05T05:12:47+00:00

Pull md updates from Neil Brown:
 "Two major components to this update.

   1) The clustered-raid1 support from SUSE is nearly complete.  There
      are a few outstanding issues being worked on.  Maybe half a dozen
      patches will bring this to a usable state.

   2) The first stage of journalled-raid5 support from Facebook makes an
      appearance.  With a journal device configured (typically NVRAM or
      SSD), the "RAID5 write hole" should be closed - a crash during
      degraded operations cannot result in data corruption.

      The next stage will be to use the journal as a write-behind cache
      so that latency can be reduced and in some cases throughput
      increased by performing more full-stripe writes.

* tag 'md/4.4' of git://neil.brown.name/md: (66 commits)
  MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
  MD: set journal disk ->raid_disk
  MD: kick out journal disk if it's not fresh
  raid5-cache: start raid5 readonly if journal is missing
  MD: add new bit to indicate raid array with journal
  raid5-cache: IO error handling
  raid5: journal disk can't be removed
  raid5-cache: add trim support for log
  MD: fix info output for journal disk
  raid5-cache: use bio chaining
  raid5-cache: small log->seq cleanup
  raid5-cache: new helper: r5_reserve_log_entry
  raid5-cache: inline r5l_alloc_io_unit into r5l_new_meta
  raid5-cache: take rdev->data_offset into account early on
  raid5-cache: refactor bio allocation
  raid5-cache: clean up r5l_get_meta
  raid5-cache: simplify state machine when caches flushes are not needed
  raid5-cache: factor out a helper to run all stripes for an I/O unit
  raid5-cache: rename flushed_ios to finished_ios
  raid5-cache: free I/O units earlier
  ...

MD: set journal disk ->raid_disk

2015-11-01T02:48:29+00:00

Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.

Signed-off-by: Shaohua Li 
Signed-off-by: NeilBrown

raid5-cache: start raid5 readonly if journal is missing

2015-11-01T02:48:29+00:00

If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.

Signed-off-by: Shaohua Li 
Signed-off-by: NeilBrown