linux-stable.git/drivers/md/bcache, branch v3.18.78

bcache: fix bch_hprint crash and improve output

2017-09-27T08:57:21+00:00

commit 9276717b9e297a62d1151a43d1cd286213f68eb7 upstream.

Most importantly, solve a crash where %llu was used to format signed
numbers.  This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.

Always use the units mechanism rather than having different output
paths for simplicity.

Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100.  (Remainders of >= 1000 would print as .10).

Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown.  Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).

Signed-off-by: Michael Lyle 
Reported-by: Dmitry Yu Okunev 
Acked-by: Kent Overstreet 
Reviewed-by: Coly Li 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bcache: fix for gc and write-back race

2017-09-27T08:57:21+00:00

commit 9baf30972b5568d8b5bc8b3c46a6ec5b58100463 upstream.

gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread                               write-back thread
|                                       |bch_writeback_thread()
|bch_gc_thread()                        |
|                                       |==>read_dirty()
|==>bch_btree_gc()                      |
|==>btree_root() //get btree root       |
|                //node write locker    |
|==>bch_btree_gc_root()                 |
|                                       |==>read_dirty_submit()
|                                       |==>write_dirty()
|                                       |==>continue_at(cl,
|                                       |               write_dirty_finish,
|                                       |               system_wq);
|                                       |==>write_dirty_finish()//excute
|                                       |               //in system_wq
|                                       |==>bch_btree_insert()
|                                       |==>bch_btree_map_leaf_nodes()
|                                       |==>__bch_btree_map_nodes()
|                                       |==>btree_root //try to get btree
|                                       |              //root node read
|                                       |              //lock
|                                       |-----stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(&c->journal)
|                            //condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
|                               //journal_write in system_wq
|                               //but work queue is excuting
|                               //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
|                   //over and wake up gc,
|-------------stuck here
|==>release root node write locker

This patch alloc a separate work-queue for write-back thread to avoid such
race.

(Commit log re-organized by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui 
Acked-by: Coly Li 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bcache: Correct return value for sysfs attach errors

2017-09-27T08:57:21+00:00

commit 77fa100f27475d08a569b9d51c17722130f089e7 upstream.

If you encounter any errors in bch_cached_dev_attach it will return
a negative error code.  The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior.  Utilize 1
signed variable to use throughout the function to preserve error return
capability.

Signed-off-by: Tony Asleson 
Acked-by: Coly Li 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bcache: correct cache_dirty_target in __update_writeback_rate()

2017-09-27T08:57:21+00:00

commit a8394090a9129b40f9d90dcb7f4a49d60c727ca6 upstream.

__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.

bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)

The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.

A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.

Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.

This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().

(Commit log composed by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bcache: Fix leak of bdev reference

2017-09-27T08:57:21+00:00

commit 4b758df21ee7081ab41448d21d60367efaa625b3 upstream.

If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.

Signed-off-by: Jan Kara 
Acked-by: Coly Li 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bcache: initialize dirty stripes in flash_dev_run()

2017-09-27T08:57:21+00:00

commit 175206cf9ab63161dec74d9cd7f9992e062491f5 upstream.

bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.

Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.

This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
 -void bch_sectors_dirty_init(struct cached_dev *dc);
 +void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().

(Commit log is composed by Coly Li)

Signed-off-by: Tang Junhui 
Reviewed-by: Coly Li 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bcache: RESERVE_PRIO is too small by one when prio_buckets() is a power of two.

2016-09-01T02:05:44+00:00

[ Upstream commit acc9cf8c66c66b2cbbdb4a375537edee72be64df ]

This patch fixes a cachedev registration-time allocation deadlock.
This can deadlock on boot if your initrd auto-registeres bcache devices:

Allocator thread:
[  720.727614] INFO: task bcache_allocato:3833 blocked for more than 120 seconds.
[  720.732361]  [] schedule+0x37/0x90
[  720.732963]  [] bch_bucket_alloc+0x188/0x360 [bcache]
[  720.733538]  [] ? prepare_to_wait_event+0xf0/0xf0
[  720.734137]  [] bch_prio_write+0x19d/0x340 [bcache]
[  720.734715]  [] bch_allocator_thread+0x3ff/0x470 [bcache]
[  720.735311]  [] ? __schedule+0x2dc/0x950
[  720.735884]  [] ? invalidate_buckets+0x980/0x980 [bcache]

Registration thread:
[  720.710403] INFO: task bash:3531 blocked for more than 120 seconds.
[  720.715226]  [] schedule+0x37/0x90
[  720.715805]  [] __bch_btree_map_nodes+0x12d/0x150 [bcache]
[  720.716409]  [] ? bch_btree_insert_check_key+0x1c0/0x1c0 [bcache]
[  720.717008]  [] bch_btree_insert+0xf4/0x170 [bcache]
[  720.717586]  [] ? prepare_to_wait_event+0xf0/0xf0
[  720.718191]  [] bch_journal_replay+0x14a/0x290 [bcache]
[  720.718766]  [] ? ttwu_do_activate.constprop.94+0x5d/0x70
[  720.719369]  [] ? try_to_wake_up+0x1d4/0x350
[  720.719968]  [] run_cache_set+0x580/0x8e0 [bcache]
[  720.720553]  [] register_bcache+0xe2e/0x13b0 [bcache]
[  720.721153]  [] kobj_attr_store+0xf/0x20
[  720.721730]  [] sysfs_kf_write+0x3d/0x50
[  720.722327]  [] kernfs_fop_write+0x12a/0x180
[  720.722904]  [] __vfs_write+0x37/0x110
[  720.723503]  [] ? __sb_start_write+0x58/0x110
[  720.724100]  [] ? security_file_permission+0x23/0xa0
[  720.724675]  [] vfs_write+0xa9/0x1b0
[  720.725275]  [] ? do_audit_syscall_entry+0x6c/0x70
[  720.725849]  [] SyS_write+0x55/0xd0
[  720.726451]  [] ? do_page_fault+0x30/0x80
[  720.727045]  [] system_call_fastpath+0x12/0x71

The fifo code in upstream bcache can't use the last element in the buffer,
which was the cause of the bug: if you asked for a power of two size,
it'd give you a fifo that could hold one less than what you asked for
rather than allocating a buffer twice as big.

Signed-off-by: Kent Overstreet 
Tested-by: Eric Wheeler 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

bcache: register_bcache(): call blkdev_put() when cache_alloc() fails

2016-09-01T02:05:44+00:00

[ Upstream commit d9dc1702b297ec4a6bb9c0326a70641b322ba886 ]

register_cache() is supposed to return an error string on error so that
register_bcache() will will blkdev_put and cleanup other user counters,
but it does not set 'char *err' when cache_alloc() fails (eg, due to
memory pressure) and thus register_bcache() performs no cleanup.

register_bcache() <----------\  <- no jump to err_close, no blkdev_put()
   |                         |
   +->register_cache()       |  <- fails to set char *err
         |                   |
         +->cache_alloc() ---/  <- returns error

This patch sets `char *err` for this failure case so that register_cache()
will cause register_bcache() to correctly jump to err_close and do
cleanup.  This was tested under OOM conditions that triggered the bug.

Signed-off-by: Eric Wheeler 
Cc: Kent Overstreet 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

bcache: fix cache_set_flush() NULL pointer dereference on OOM

2016-04-18T12:49:25+00:00

[ Upstream commit f8b11260a445169989d01df75d35af0f56178f95 ]

When bch_cache_set_alloc() fails to kzalloc the cache_set, the
asyncronous closure handling tries to dereference a cache_set that
hadn't yet been allocated inside of cache_set_flush() which is called
by __cache_set_unregister() during cleanup.  This appears to happen only
during an OOM condition on bcache_register.

Signed-off-by: Eric Wheeler 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

bcache: cleaned up error handling around register_cache()

2016-04-18T12:49:25+00:00

[ Upstream commit 9b299728ed777428b3908ac72ace5f8f84b97789 ]

Fix null pointer dereference by changing register_cache() to return an int
instead of being void.  This allows it to return -ENOMEM or -ENODEV and
enables upper layers to handle the OOM case without NULL pointer issues.

See this thread:
  http://thread.gmane.org/gmane.linux.kernel.bcache.devel/3521

Fixes this error:
  gargamel:/sys/block/md5/bcache# echo /dev/sdh2 > /sys/fs/bcache/register

  bcache: register_cache() error opening sdh2: cannot allocate memory
  BUG: unable to handle kernel NULL pointer dereference at 00000000000009b8
  IP: [] cache_set_flush+0x102/0x15c [bcache]
  PGD 120dff067 PUD 1119a3067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in: veth ip6table_filter ip6_tables
  (...)
  CPU: 4 PID: 3371 Comm: kworker/4:3 Not tainted 4.4.2-amd64-i915-volpreempt-20160213bc1 #3
  Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
  Workqueue: events cache_set_flush [bcache]
  task: ffff88020d5dc280 ti: ffff88020b6f8000 task.ti: ffff88020b6f8000
  RIP: 0010:[]  [] cache_set_flush+0x102/0x15c [bcache]

Signed-off-by: Eric Wheeler 
Tested-by: Marc MERLIN 
Cc: 
Signed-off-by: Sasha Levin