linux-stable.git/drivers/block/zram, branch linux-3.17.y

zram: avoid kunmap_atomic() of a NULL pointer

2014-11-21T17:23:08+00:00

commit c406515239376fc93a30d5d03192182160cbd3fb upstream.

zram could kunmap_atomic() a NULL pointer in a rare situation: a zram
page becomes a full-zeroed page after a partial write io.  The current
code doesn't handle this case and performs kunmap_atomic() on a NULL
pointer, which panics the kernel.

This patch fixes this issue.

Signed-off-by: Weijie Yang 
Cc: Sergey Senozhatsky 
Cc: Dan Streetman 
Cc: Nitin Gupta 
Cc: Weijie Yang 
Acked-by: Jerome Marchand 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

zram: fix incorrect stat with failed_reads

2014-08-29T23:28:16+00:00

Since we allocate a temporary buffer in zram_bvec_read to handle partial
page operations in commit 924bd88d703e ("Staging: zram: allow partial
page operations"), our ->failed_reads value may be incorrect as we do
not increase its value when failing to allocate the temporary buffer.

Let's fix this issue and correct the annotation of failed_reads.

Signed-off-by: Chao Yu 
Acked-by: Minchan Kim 
Cc: Nitin Gupta 
Acked-by: Jerome Marchand 
Acked-by: Sergey Senozhatsky 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: replace global tb_lock with fine grain lock

2014-08-07T01:01:23+00:00

Currently, we use a rwlock tb_lock to protect concurrent access to the
whole zram meta table.  However, according to the actual access model,
there is only a small chance for upper user to access the same
table[index], so the current lock granularity is too big.

The idea of optimization is to change the lock granularity from whole
meta table to per table entry (table -> table[index]), so that we can
protect concurrent access to the same table[index], meanwhile allow the
maximum concurrency.

With this in mind, several kinds of locks which could be used as a
per-entry lock were tested and compared:

Test environment:
x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.

iozone test:
iozone -t 4 -R -r 16K -s 200M -I +Z
(1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)

      Test       base      CAS    spinlock    rwlock   bit_spinlock
-------------------------------------------------------------------
 Initial write  1381094   1425435   1422860   1423075   1421521
       Rewrite  1529479   1641199   1668762   1672855   1654910
          Read  8468009  11324979  11305569  11117273  10997202
       Re-read  8467476  11260914  11248059  11145336  10906486
  Reverse Read  6821393   8106334   8282174   8279195   8109186
   Stride read  7191093   8994306   9153982   8961224   9004434
   Random read  7156353   8957932   9167098   8980465   8940476
Mixed workload  4172747   5680814   5927825   5489578   5972253
  Random write  1483044   1605588   1594329   1600453   1596010
        Pwrite  1276644   1303108   1311612   1314228   1300960
         Pread  4324337   4632869   4618386   4457870   4500166

To enhance the possibility of access the same table[index] concurrently,
set zram a small disksize(10MB) and let threads run with large loop
count.

fio test:
fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
--scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
--filename=/dev/zram0 --name=seq-write --rw=write --stonewall
--name=seq-read --rw=read --stonewall --name=seq-readwrite
--rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
(10MB zram raw block device, take the average of 10 tests, KB/s)

    Test     base     CAS    spinlock    rwlock  bit_spinlock
-------------------------------------------------------------
seq-write   933789   999357   1003298    995961   1001958
 seq-read  5634130  6577930   6380861   6243912   6230006
   seq-rw  1405687  1638117   1640256   1633903   1634459
  rand-rw  1386119  1614664   1617211   1609267   1612471

All the optimization methods show a higher performance than the base,
however, it is hard to say which method is the most appropriate.

On the other hand, zram is mostly used on small embedded system, so we
don't want to increase any memory footprint.

This patch pick the bit_spinlock method, pack object size and page_flag
into an unsigned long table.value, so as to not increase any memory
overhead on both 32-bit and 64-bit system.

On the third hand, even though different kinds of locks have different
performances, we can ignore this difference, because: if zram is used as
zram swapfile, the swap subsystem can prevent concurrent access to the
same swapslot; if zram is used as zram-blk for set up filesystem on it,
the upper filesystem and the page cache also prevent concurrent access
of the same block mostly.  So we can ignore the different performances
among locks.

Acked-by: Sergey Senozhatsky 
Reviewed-by: Davidlohr Bueso 
Signed-off-by: Weijie Yang 
Signed-off-by: Minchan Kim 
Cc: Jerome Marchand 
Cc: Nitin Gupta 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: use size_t instead of u16

2014-08-07T01:01:23+00:00

Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
or more.  In these cases u16 is not sufficiently large to represent a
compressed page's size so use size_t.

Signed-off-by: Minchan Kim 
Reported-by: Weijie Yang 
Acked-by: Sergey Senozhatsky 
Cc: Jerome Marchand 
Cc: Nitin Gupta 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: remove unused SECTOR_SIZE define

2014-08-07T01:01:22+00:00

Drop SECTOR_SIZE define, because it's not used.

Signed-off-by: Sergey Senozhatsky 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Weijie Yang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: rename struct `table' to `zram_table_entry'

2014-08-07T01:01:22+00:00

Andrew Morton has recently noted that `struct table' actually represents
table entry and, thus, should be renamed.  Rename to `zram_table_entry'.

Signed-off-by: Sergey Senozhatsky 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Weijie Yang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: avoid lockdep splat by revalidate_disk

2014-07-23T22:10:54+00:00

Sasha reported lockdep warning [1] introduced by [2].

It could be fixed by doing disk revalidation out of the init_lock.  It's
okay because disk capacity change is protected by init_lock so that
revalidate_disk always sees up-to-date value so there is no race.

[1] https://lkml.org/lkml/2014/7/3/735
[2] zram: revalidate disk after capacity change

Fixes 2e32baea46ce ("zram: revalidate disk after capacity change").

Signed-off-by: Minchan Kim 
Reported-by: Sasha Levin 
Cc: "Alexander E. Patrakov" 
Cc: Nitin Gupta 
Cc: Jerome Marchand 
Cc: Sergey Senozhatsky 
CC: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: revalidate disk after capacity change

2014-07-03T16:21:53+00:00

Alexander reported mkswap on /dev/zram0 is failed if other process is
opening the block device file.

Step is as follows,

0. Reset the unused zram device.
1. Use a program that opens /dev/zram0 with O_RDWR and sleeps
   until killed.
2. While that program sleeps, echo the correct value to
   /sys/block/zram0/disksize.
3. Verify (e.g. in /proc/partitions) that the disk size is applied
   correctly. It is.
4. While that program still sleeps, attempt to mkswap /dev/zram0.
   This fails: mkswap: error: swap area needs to be at least 40 KiB

When I investigated, the size get by ioctl(fd, BLKGETSIZE64, xxx) on
mkswap to get a size of blockdev was zero although zram0 has right size by
2.

The reason is zram didn't revalidate disk after changing capacity so that
size of blockdev's inode is not uptodate until all of file is close.

This patch should fix the BUG.

Signed-off-by: Minchan Kim 
Reported-by: Alexander E. Patrakov 
Tested-by: Alexander E. Patrakov 
Reviewed-by: Sergey Senozhatsky 
Cc: Nitin Gupta 
Acked-by: Jerome Marchand 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: correct offset usage in zram_bio_discard

2014-06-04T23:54:13+00:00

We want to skip the physical block(PAGE_SIZE) which is partially covered
by the discard bio, so we check the remaining size and subtract it if
there is a need to goto the next physical block.

The current offset usage in zram_bio_discard is incorrect, it will cause
its upper filesystem breakdown.  Consider the following scenario:

On some architecture or config, PAGE_SIZE is 64K for example, filesystem
is set up on zram disk without PAGE_SIZE aligned, a discard bio leads to a
offset = 4K and size=72K, normally, it should not really discard any
physical block as it partially cover two physical blocks.  However, with
the current offset usage, it will discard the second physical block and
free its memory, which will cause filesystem breakdown.

This patch corrects the offset usage in zram_bio_discard.

Signed-off-by: Weijie Yang 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Acked-by: Joonsoo Kim 
Cc: Sergey Senozhatsky 
Cc: Bob Liu 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

zram: support REQ_DISCARD

2014-04-07T23:36:02+00:00

zram is ram based block device and can be used by backend of filesystem.
When filesystem deletes a file, it normally doesn't do anything on data
block of that file.  It just marks on metadata of that file.  This
behavior has no problem on disk based block device, but has problems on
ram based block device, since we can't free memory used for data block.
To overcome this disadvantage, there is REQ_DISCARD functionality.  If
block device support REQ_DISCARD and filesystem is mounted with discard
option, filesystem sends REQ_DISCARD to block device whenever some data
blocks are discarded.  All we have to do is to handle this request.

This patch implements to flag up QUEUE_FLAG_DISCARD and handle this
REQ_DISCARD request.  With it, we can free memory used by zram if it isn't
used.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Joonsoo Kim 
Cc: Minchan Kim 
Cc: Nitin Gupta 
Cc: Sergey Senozhatsky 
Cc: Jerome Marchand 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds