linux.git/drivers/md/dm-writecache.c, branch v5.6-rc2

dm writecache: improve performance of large linear writes on SSDs

2020-01-16T18:34:17+00:00

When dm-writecache is used with SSD as a cache device, it would submit a
separate bio for each written block. The I/Os would be merged by the disk
scheduler, but this merging degrades performance.

Improve dm-writecache performance by submitting larger bios - this is
possible as long as there is consecutive free space on the cache
device.

Benchmark (arm64 with 64k page size, using /dev/ram0 as a cache device):

fio --bs=512k --iodepth=32 --size=400M --direct=1 \
    --filename=/dev/mapper/cache --rw=randwrite --numjobs=1 --name=test

block	old	new
size	MiB/s	MiB/s
---------------------
512	181	700
1k	347	1256
2k	644	2020
4k	1183	2759
8k	1852	3333
16k	2469	3509
32k	2974	3670
64k	3404	3810

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: fix incorrect flush sequence when doing SSD mode commit

2020-01-15T01:22:48+00:00

When committing state, the function writecache_flush does the following:
1. write metadata (writecache_commit_flushed)
2. flush disk cache (writecache_commit_flushed)
3. wait for data writes to complete (writecache_wait_for_ios)
4. increase superblock seq_count
5. write the superblock
6. flush disk cache

It may happen that at step 3, when we wait for some write to finish, the
disk may report the write as finished, but the write only hit the disk
cache and it is not yet stored in persistent storage. At step 5 we write
the superblock - it may happen that the superblock is written before the
write that we waited for in step 3. If the machine crashes, it may result
in incorrect data being returned after reboot.

In order to fix the bug, we must swap steps 2 and 3 in the above sequence,
so that we first wait for writes to complete and then flush the disk
cache.

Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: handle REQ_FUA

2019-11-05T19:21:40+00:00

Call writecache_flush() on REQ_FUA in writecache_map().

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: Maged Mokhtar 
Acked-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: fix uninitialized variable warning

2019-11-05T19:11:44+00:00

This fixes coverity warning CID 1454301.

Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: skip writecache_wait for pmem mode

2019-09-05T17:22:05+00:00

The array bio_in_progress[2] only have chance to be increased and
decreased with ssd mode. For pmem mode, they are not involved at all.
So skip writecache_wait_for_ios in writecache_flush for pmem.

Suggested-by: Doris Yu 
Signed-off-by: Huaisheng Ye 
Acked-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: optimize performance by sorting the blocks for writeback_all

2019-08-26T14:59:00+00:00

During the process of writeback, the blocks, which have been placed in wbl.list
for writeback soon, are partially ordered for the contiguous ones.

When writeback_all has been set, for most cases, also by default, there will be
a lot of blocks in pmem need to writeback at the same time.
For this case, we could optimize the performance by sorting all blocks in
wbl.list. writecache_writeback doesn't need to get blocks from the tail of
wc->lru, whereas from the first rb_node from the rb_tree.

The benefit is that, writecache_writeback doesn't need to have any cost to sort
the blocks, because of all blocks are incremental originally in rb_tree.
There will be a writecache_flush when writeback_all begins to work, that will
eliminate duplicate blocks in cache by committed/uncommitted.

Testing platform: Thinksystem SR630 with persistent memory.
The cache comes from pmem, which has 1006MB size. The origin device is HDD, 2GB
of which for using.

Testing steps:
 1) dmsetup create mycache --table '0 4194304 writecache p /dev/sdb1 /dev/pmem4  4096 0'
 2) fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
 3) time dmsetup message /dev/mapper/mycache 0 flush

Here is the results below,
With the patch:
 # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
   iops        : min= 1582, max=199470, avg=5305.94, stdev=21273.44, samples=197
 # time dmsetup message /dev/mapper/mycache 0 flush
real	0m44.020s
user	0m0.002s
sys	0m0.003s

Without the patch:
 # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
 -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
   iops        : min= 1202, max=197650, avg=4968.67, stdev=20480.17, samples=211
 # time dmsetup message /dev/mapper/mycache 0 flush
real	1m39.221s
user	0m0.001s
sys	0m0.003s

I also have checked the data accuracy with this patch by making EXT4 filesystem
on mycache, then mount it for checking md5 of files on that.
The test result is positive, with this patch it could save more than half of time
when writeback_all.

Signed-off-by: Huaisheng Ye 
Signed-off-by: Mike Snitzer

dm writecache: add unlikely for getting two block with same LBA

2019-08-26T14:54:41+00:00

In function writecache_writeback, entries g and f has same original
sector only happens at entry f has been committed, but entry g has
NOT yet.

The probability of this happening is very low in the following
256 blocks at most of entry e.

Signed-off-by: Huaisheng Ye 
Acked-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: remove unused member pointer in writeback_struct

2019-08-26T14:54:15+00:00

The stucture member pointer page in writeback_struct never has been
used actually. Remove it.

Signed-off-by: Huaisheng Ye 
Acked-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: avoid unnecessary lookups in writecache_find_entry()

2019-04-26T15:48:03+00:00

This is a small optimization in writecache_find_entry().

If we go past the condition "if (unlikely(!node))", we can be certain that
there is no entry in the tree that has the block equal to the "block"
variable.

Consequently, we can return the next entry directly, we don't need to go
to the second part of the function that finds the entry with lowest or
highest seq number that matches the "block" variable.

Also, add some whitespace and cleanup needless braces.

Suggested-by: Huaisheng Ye 
Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer

dm writecache: remove unused member page_offset in writeback_struct

2019-04-26T15:32:50+00:00

The stucture member page_offset in writeback_struct never has been
used actually. Remove it.

Signed-off-by: Huaisheng Ye 
Acked-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer