linux.git/drivers/md, branch v3.15-rc2

Merge tag '3.15-fixes' of git://neil.brown.name/md

2014-04-17T17:51:01+00:00

Pull md bugfix from Neil Brown:
 "One BUG fix for md for recent commit"

* tag '3.15-fixes' of git://neil.brown.name/md:
  raid5: fix a race of stripe count check

raid5: fix a race of stripe count check

2014-04-17T07:05:28+00:00

I hit another BUG_ON with e240c1839d11152b0355442. In __get_priority_stripe(),
stripe count equals to 0 initially. Between atomic_inc and BUG_ON,
get_active_stripe() finds the stripe. So the stripe count isn't 1 any more.

V2: keeps the BUG_ON suggested by Neil.

Signed-off-by: Shaohua Li 
Signed-off-by: NeilBrown

Merge tag 'md/3.15' of git://neil.brown.name/md

2014-04-12T00:20:38+00:00

Pull md updates from Neil Brown:
 "Just a few md patches for the 3.15 merge window.

  Not much happening in md/raid at the moment.  Just a few bug fixes
  (one for -stable) and a couple of performance tweaks"

* tag 'md/3.15' of git://neil.brown.name/md:
  raid5: get_active_stripe avoids device_lock
  raid5: make_request does less prepare wait
  md: avoid oops on unload if some process is in poll or select.
  md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails.
  md/bitmap: don't abuse i_writecount for bitmap files.

raid5: get_active_stripe avoids device_lock

2014-04-09T04:42:42+00:00

For sequential workload (or request size big workload), get_active_stripe can
find cached stripe. In this case, we always hold device_lock, which exposes a
lot of lock contention for such workload. If stripe count isn't 0, we don't
need hold the lock actually, since we just increase its count. And this is the
hot code path for such workload. Unfortunately we must delete the BUG_ON.

Signed-off-by: Shaohua Li 
Signed-off-by: NeilBrown

raid5: make_request does less prepare wait

2014-04-09T04:42:38+00:00

In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
lot of contention for sequential workload (or big request size
workload). For such workload, each bio includes several stripes. So we
can just do prepare_to_wait/finish_wait once for the whold bio instead
of every stripe.  This reduces the lock contention completely for such
workload. Random workload might have the similar lock contention too,
but I didn't see it yet, maybe because my stroage is still not fast
enough.

Signed-off-by: Shaohua Li 
Signed-off-by: NeilBrown

md: avoid oops on unload if some process is in poll or select.

2014-04-09T04:42:34+00:00

If md-mod is unloaded while some process is in poll() or select(),
then that process maintains a pointer to md_event_waiters, and when
the try to unlink from that list, they will oops.

The procfs infrastructure ensures that ->poll won't be called after
remove_proc_entry, but doesn't provide a wait_queue_head for us to
use, and the waitqueue code doesn't provide a way to remove all
listeners from a waitqueue.

So we need to:
 1/ make sure no further references to md_event_waiters are taken (by
    setting md_unloading)
 2/ wake up all processes currently waiting, and
 3/ wait until all those processes have disconnected from our
    wait_queue_head.

Reported-by: "majianpeng" 
Signed-off-by: NeilBrown

md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails.

2014-04-09T04:42:23+00:00

When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
on a raid1, we allocate multiple bios each with their own set of pages.

If the page allocations for one bio fails, we currently do *not* free
the pages allocated for the previous bios, nor do we free the bio itself.

This patch frees all the already-allocate pages, and makes sure that
all the bios are freed as well.

This bug can cause a memory leak which can ultimately OOM a machine.
It was introduced in 3.10-rc1.

Fixes: a07876064a0b73ab5ef1ebcf14b1cf0231c07858
Cc: Kent Overstreet 
Cc: stable@vger.kernel.org (3.10+)
Reported-by: Russell King - ARM Linux 
Signed-off-by: NeilBrown

md/bitmap: don't abuse i_writecount for bitmap files.

2014-04-09T02:26:59+00:00

md bitmap code currently tries to use i_writecount to stop any other
process from writing to out bitmap file.  But that is really an abuse
and has bit-rotted so locking is all wrong.

So discard that - root should be allowed to shoot self in foot.

Still use it in a much less intrusive way to stop the same file being
used as bitmap on two different array, and apply other checks to
ensure the file is at least vaguely usable for bitmap storage
(is regular, is open for write.  Support for ->bmap is already checked
elsewhere).

Reported-by: Al Viro 
Signed-off-by: NeilBrown

Merge tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

2014-04-06T01:49:31+00:00

Pull device mapper changes from Mike Snitzer:

 - Fix dm-cache corruption caused by discard_block_size > cache_block_size

 - Fix a lock-inversion detected by LOCKDEP in dm-cache

 - Fix a dangling bio bug in the dm-thinp target's process_deferred_bios
   error path

 - Fix corruption due to non-atomic transaction commit which allowed a
   metadata superblock to be written before all other metadata was
   successfully written -- this is common to all targets that use the
   persistent-data library's transaction manager (dm-thinp, dm-cache and
   dm-era).

 - Various small cleanups in the DM core

 - Add the dm-era target which is useful for keeping track of which
   blocks were written within a user defined period of time called an
   'era'.  Use cases include tracking changed blocks for backup
   software, and partially invalidating the contents of a cache to
   restore cache coherency after rolling back a vendor snapshot.

 - Improve the on-disk layout of multithreaded writes to the
   dm-thin-pool by splitting the pool's deferred bio list to be a
   per-thin device list and then sorting that list using an rb_tree.
   The subsequent read throughput of the data written via multiple
   threads improved by ~70%.

 - Simplify the multipath target's handling of queuing IO by pushing
   requests back to the request queue rather than queueing the IO
   internally.

* tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
  dm cache: fix a lock-inversion
  dm thin: sort the per thin deferred bios using an rb_tree
  dm thin: use per thin device deferred bio lists
  dm thin: simplify pool_is_congested
  dm thin: fix dangling bio in process_deferred_bios error path
  dm mpath: print more useful warnings in multipath_message()
  dm-mpath: do not activate failed paths
  dm mpath: remove extra nesting in map function
  dm mpath: remove map_io()
  dm mpath: reduce memory pressure when requeuing
  dm mpath: remove process_queued_ios()
  dm mpath: push back requests instead of queueing
  dm table: add dm_table_run_md_queue_async
  dm mpath: do not call pg_init when it is already running
  dm: use RCU_INIT_POINTER instead of rcu_assign_pointer in __unbind
  dm: stop using bi_private
  dm: remove dm_get_mapinfo
  dm: make dm_table_alloc_md_mempools static
  dm: take care to copy the space map roots before locking the superblock
  dm transaction manager: fix corruption due to non-atomic transaction commit
  ...

dm cache: fix a lock-inversion

2014-04-04T18:53:05+00:00

When suspending a cache the policy is walked and the individual policy
hints written to the metadata via sync_metadata().  This led to this
lock order:

      policy->lock
        cache_metadata->root_lock

When loading the cache target the policy is populated while the metadata
lock is held:

      cache_metadata->root_lock
         policy->lock

Fix this potential lock-inversion (ABBA) deadlock in sync_metadata() by
ensuring the cache_metadata root_lock is held whilst all the hints are
written, rather than being repeatedly locked while policy->lock is held
(as was the case with each callout that policy_walk_mappings() made to
the old save_hint() method).

Found by turning on the CONFIG_PROVE_LOCKING ("Lock debugging: prove
locking correctness") build option.  However, it is not clear how the
LOCKDEP reported paths can lead to a deadlock since the two paths,
suspending a target and loading a target, never occur at the same time.
But that doesn't mean the same lock-inversion couldn't have occurred
elsewhere.

Reported-by: Marian Csontos 
Signed-off-by: Joe Thornber 
Signed-off-by: Mike Snitzer 
Cc: stable@vger.kernel.org