linux-stable.git/drivers/md/dm-snap.c, branch linux-4.5.y

dm snapshot: disallow the COW and origin devices from being identical

2016-04-12T14:33:19+00:00

commit 4df2bf466a9c9c92f40d27c4aa9120f4e8227bfc upstream.

Otherwise loading a "snapshot" table using the same device for the
origin and COW devices, e.g.:

echo "0 20971520 snapshot 253:3 253:3 P 8" | dmsetup create snap

will trigger:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
[ 1958.979934] IP: [] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1958.989655] PGD 0
[ 1958.991903] Oops: 0000 [#1] SMP
...
[ 1959.059647] CPU: 9 PID: 3556 Comm: dmsetup Tainted: G          IO    4.5.0-rc5.snitm+ #150
...
[ 1959.083517] task: ffff8800b9660c80 ti: ffff88032a954000 task.ti: ffff88032a954000
[ 1959.091865] RIP: 0010:[]  [] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.104295] RSP: 0018:ffff88032a957b30  EFLAGS: 00010246
[ 1959.110219] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000001
[ 1959.118180] RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff880329334a00
[ 1959.126141] RBP: ffff88032a957b50 R08: 0000000000000000 R09: 0000000000000001
[ 1959.134102] R10: 000000000000000a R11: f000000000000000 R12: ffff880330884d80
[ 1959.142061] R13: 0000000000000008 R14: ffffc90001c13088 R15: ffff880330884d80
[ 1959.150021] FS:  00007f8926ba3840(0000) GS:ffff880333440000(0000) knlGS:0000000000000000
[ 1959.159047] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1959.165456] CR2: 0000000000000098 CR3: 000000032f48b000 CR4: 00000000000006e0
[ 1959.173415] Stack:
[ 1959.175656]  ffffc90001c13040 ffff880329334a00 ffff880330884ed0 ffff88032a957bdc
[ 1959.183946]  ffff88032a957bb8 ffffffffa040f225 ffff880329334a30 ffff880300000000
[ 1959.192233]  ffffffffa04133e0 ffff880329334b30 0000000830884d58 00000000569c58cf
[ 1959.200521] Call Trace:
[ 1959.203248]  [] dm_exception_store_create+0x1d5/0x240 [dm_snapshot]
[ 1959.211986]  [] snapshot_ctr+0x140/0x630 [dm_snapshot]
[ 1959.219469]  [] ? dm_split_args+0x64/0x150 [dm_mod]
[ 1959.226656]  [] dm_table_add_target+0x177/0x440 [dm_mod]
[ 1959.234328]  [] table_load+0x143/0x370 [dm_mod]
[ 1959.241129]  [] ? retrieve_status+0x1b0/0x1b0 [dm_mod]
[ 1959.248607]  [] ctl_ioctl+0x255/0x4d0 [dm_mod]
[ 1959.255307]  [] ? memzero_explicit+0x12/0x20
[ 1959.261816]  [] dm_ctl_ioctl+0x13/0x20 [dm_mod]
[ 1959.268615]  [] do_vfs_ioctl+0xa6/0x5c0
[ 1959.274637]  [] ? __audit_syscall_entry+0xaf/0x100
[ 1959.281726]  [] ? do_audit_syscall_entry+0x66/0x70
[ 1959.288814]  [] SyS_ioctl+0x79/0x90
[ 1959.294450]  [] entry_SYSCALL_64_fastpath+0x12/0x71
...
[ 1959.323277] RIP  [] dm_exception_store_set_chunk_size+0x7a/0x110 [dm_snapshot]
[ 1959.333090]  RSP 
[ 1959.336978] CR2: 0000000000000098
[ 1959.344121] ---[ end trace b049991ccad1169e ]---

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1195899
Signed-off-by: Ding Xiang 
Signed-off-by: Mike Snitzer 
Signed-off-by: Greg Kroah-Hartman

dm snapshot: fix hung bios when copy error occurs

2016-01-09T01:03:05+00:00

When there is an error copying a chunk dm-snapshot can incorrectly hold
associated bios indefinitely, resulting in hung IO.

The function copy_callback sets pe->error if there was error copying the
chunk, and then calls complete_exception.  complete_exception calls
pending_complete on error, otherwise it calls commit_exception with
commit_callback (and commit_callback calls complete_exception).

The persistent exception store (dm-snap-persistent.c) assumes that calls
to prepare_exception and commit_exception are paired.
persistent_prepare_exception increases ps->pending_count and
persistent_commit_exception decreases it.

If there is a copy error, persistent_prepare_exception is called but
persistent_commit_exception is not.  This results in the variable
ps->pending_count never returning to zero and that causes some pending
exceptions (and their associated bios) to be held forever.

Fix this by unconditionally calling commit_exception regardless of
whether the copy was successful.  A new "valid" parameter is added to
commit_exception -- when the copy fails this parameter is set to zero so
that the chunk that failed to copy (and all following chunks) is not
recorded in the snapshot store.  Also, remove commit_callback now that
it is merely a wrapper around pending_complete.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
Cc: stable@vger.kernel.org

dm: don't save and restore bi_private

2015-12-10T15:38:56+00:00

Device mapper used the field bi_private to point to dm_target_io. However,
since kernel 3.15, the bi_private field is unused, and so the targets do
not need to save and restore this field.

This patch removes code that saves and restores bi_private from dm-cache,
dm-snapshot and dm-verity.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm snapshot: add new persistent store option to support overflow

2015-10-09T20:57:03+00:00

Commit 76c44f6d80 introduced the possibly for "Overflow" to be reported
by the snapshot device's status.  Older userspace (e.g. lvm2) does not
handle the "Overflow" status response.

Fix this incompatibility by requiring newer userspace code, that can
cope with "Overflow", request the persistent store with overflow support
by using "PO" (Persistent with Overflow) for the snapshot store type.

Reported-by: Zdenek Kabelac 
Fixes: 76c44f6d80 ("dm snapshot: don't invalidate on-disk image on snapshot write overflow")
Reviewed-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

Merge tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

2015-09-02T23:35:26+00:00

Pull device mapper update from Mike Snitzer:

 - a couple small cleanups in dm-cache, dm-verity, persistent-data's
   dm-btree, and DM core.

 - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio
   prison cells

 - a 4.2-stable fix that adds feature reporting for the dm-stats
   features added in 4.2

 - improve DM-snapshot to not invalidate the on-disk snapshot if
   snapshot device write overflow occurs; but a write overflow triggered
   through the origin device will still invalidate the snapshot.

 - optimize DM-thinp's async discard submission a bit now that late bio
   splitting has been included in block core.

 - switch DM-cache's SMQ policy lock from using a mutex to a spinlock;
   improves performance on very low latency devices (eg. NVMe SSD).

 - document DM RAID 4/5/6's discard support

[ I did not pull the slab changes, which weren't appropriate for this
  tree, and weren't obviously the right thing to do anyway.  At the very
  least they need some discussion and explanation before getting merged.

  Because not pulling the actual tagged commit but doing a partial pull
  instead, this merge commit thus also obviously is missing the git
  signature from the original tag ]

* tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm cache: fix use after freeing migrations
  dm cache: small cleanups related to deferred prison cell cleanup
  dm cache: fix leaking of deferred bio prison cells
  dm raid: document RAID 4/5/6 discard support
  dm stats: report precise_timestamps and histogram in @stats_list output
  dm thin: optimize async discard submission
  dm snapshot: don't invalidate on-disk image on snapshot write overflow
  dm: remove unlikely() before IS_ERR()
  dm: do not override error code returned from dm_get_device()
  dm: test return value for DM_MAPIO_SUBMITTED
  dm verity: remove unused mempool
  dm cache: move wake_waker() from free_migrations() to where it is needed
  dm btree remove: remove unused function get_nr_entries()
  dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling()
  dm cache policy smq: change the mutex to a spinlock

block: kill merge_bvec_fn() completely

2015-08-13T18:31:57+00:00

As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.

Cc: Jens Axboe 
Cc: Lars Ellenberg 
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina 
Cc: Yehuda Sadeh 
Cc: Sage Weil 
Cc: Alex Elder 
Cc: ceph-devel@vger.kernel.org
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Neil Brown 
Cc: linux-raid@vger.kernel.org
Cc: Christoph Hellwig 
Cc: "Martin K. Petersen" 
Acked-by: NeilBrown  (for the 'md' bits)
Acked-by: Mike Snitzer 
Signed-off-by: Kent Overstreet 
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
 dm-era-target, and resolve merge conflicts]
Signed-off-by: Dongsu Park 
Signed-off-by: Ming Lin 
Signed-off-by: Jens Axboe

dm snapshot: don't invalidate on-disk image on snapshot write overflow

2015-08-12T20:22:55+00:00

When the snapshot overflows because of a write to the origin, the on-disk
image has to be invalidated.  However, when the snapshot overflows because
of a write to the snapshot, the on-disk image doesn't have to be
invalidated.  Change the behavior so that the on-disk image is not
invalidated in this case.

When the snapshot overflows, the variable snapshot_overflowed is set.
All writes to the snapshot are disallowed to minimize filesystem
corruption - this condition is cleared when the snapshot is deactivated
and activated.

The user can extend the overflowed snapshot, deactivate and activate it
again, run fsck (if journaling filesystem is not used) mount it and
recover the data.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

block: add a bi_error field to struct bio

2015-07-29T14:55:15+00:00

Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: NeilBrown 
Signed-off-by: Jens Axboe

block: remove management of bi_remaining when restoring original bi_end_io

2015-05-22T14:58:55+00:00

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
 1) saving a bio's original bi_end_io
 2) wiring up an intermediate bi_end_io
 3) restoring the original bi_end_io from intermediate bi_end_io
 4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called.  For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN).  As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0.  When bio_inc_remaining() occurred during step
3 it brought it to a value of 2.  When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining.  For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Mike Snitzer 
Signed-off-by: Jens Axboe

bio: skip atomic inc/dec of ->bi_remaining for non-chains

2015-05-05T19:32:47+00:00

Struct bio has an atomic ref count for chained bio's, and we use this
to know when to end IO on the bio. However, most bio's are not chained,
so we don't need to always introduce this atomic operation as part of
ending IO.

Add a helper to elevate the bi_remaining count, and flag the bio as
now actually needing the decrement at end_io time. Rename the field
to __bi_remaining to catch any current users of this doing the
incrementing manually.

For high IOPS workloads, this reduces the overhead of bio_endio()
substantially.

Tested-by: Robert Elliott 
Acked-by: Kent Overstreet 
Reviewed-by: Jan Kara 
Signed-off-by: Jens Axboe