linux-stable.git/block, branch linux-4.1.y

blk-mq: fix kernel oops in blk_mq_tag_idle()

2018-05-23T01:36:29+00:00

[ Upstream commit 8ab0b7dc73e1b3e2987d42554b2bff503f692772 ]

HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
then we need to check it before calling blk_mq_tag_idle(), otherwise
the following kernel oops can be triggered, so fix it by checking if
the hw queue is unmapped since it doesn't make sense to idle the tags
any more after hw queues are unmapped.

[  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
[  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
[  440.788359] RIP: 0010:[]  [] __blk_mq_tag_idle+0x24/0x40
[  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
[  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
[  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
[  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
[  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
[  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
[  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
[  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
[  440.876169] Call Trace:
[  440.879818]  [] blk_mq_exit_hctx+0xd8/0xe0
[  440.887051]  [] blk_mq_free_queue+0xf0/0x160
[  440.894465]  [] blk_cleanup_queue+0xd9/0x150
[  440.901881]  [] nvme_ns_remove+0x5b/0xb0 [nvme_core]
[  440.910068]  [] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
[  440.919026]  [] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
[  440.928079]  [] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
[  440.937126]  [] process_one_work+0x17a/0x440
[  440.944517]  [] worker_thread+0x278/0x3c0
[  440.951607]  [] ? manage_workers.isra.24+0x2a0/0x2a0
[  440.959760]  [] kthread+0xcf/0xe0
[  440.966055]  [] ? insert_kthread_work+0x40/0x40
[  440.973715]  [] ret_from_fork+0x58/0x90
[  440.980586]  [] ? insert_kthread_work+0x40/0x40
[  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5  ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
[  441.011620] RIP  [] __blk_mq_tag_idle+0x24/0x40
[  441.019301]  RSP 
[  441.024052] CR2: 0000000000000008

Reported-by: Zhang Yi 
Tested-by: Zhang Yi 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

bio-integrity: Do not allocate integrity context for bio w/o data

2018-05-23T01:36:27+00:00

[ Upstream commit 3116a23bb30272d74ea81baf5d0ee23f602dd15b ]

If bio has no data, such as ones from blkdev_issue_flush(),
then we have nothing to protect.

This patch prevent bugon like follows:

kfree_debugcheck: out of range ptr ac1fa1d106742a5ah
kernel BUG at mm/slab.c:2773!
invalid opcode: 0000 [#1] SMP
Modules linked in: bcache
CPU: 0 PID: 4428 Comm: xfs_io Tainted: G        W       4.11.0-rc4-ext4-00041-g2ef0043-dirty #43
Hardware name: Virtuozzo KVM, BIOS seabios-1.7.5-11.vz7.4 04/01/2014
task: ffff880137786440 task.stack: ffffc90000ba8000
RIP: 0010:kfree_debugcheck+0x25/0x2a
RSP: 0018:ffffc90000babde0 EFLAGS: 00010082
RAX: 0000000000000034 RBX: ac1fa1d106742a5a RCX: 0000000000000007
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013f3ccb40
RBP: ffffc90000babde8 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000fcb76420 R11: 00000000725172ed R12: 0000000000000282
R13: ffffffff8150e766 R14: ffff88013a145e00 R15: 0000000000000001
FS:  00007fb09384bf40(0000) GS:ffff88013f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd0172f9e40 CR3: 0000000137fa9000 CR4: 00000000000006f0
Call Trace:
 kfree+0xc8/0x1b3
 bio_integrity_free+0xc3/0x16b
 bio_free+0x25/0x66
 bio_put+0x14/0x26
 blkdev_issue_flush+0x7a/0x85
 blkdev_fsync+0x35/0x42
 vfs_fsync_range+0x8e/0x9f
 vfs_fsync+0x1c/0x1e
 do_fsync+0x31/0x4a
 SyS_fsync+0x10/0x14
 entry_SYSCALL_64_fastpath+0x1f/0xc2

Reviewed-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Martin K. Petersen 
Signed-off-by: Dmitry Monakhov 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: fix an error code in add_partition()

2018-05-23T01:36:25+00:00

[ Upstream commit 7bd897cfce1eb373892d35d7f73201b0f9b221c4 ]

We don't set an error code on this path.  It means that we return NULL
instead of an error pointer and the caller does a NULL dereference.

Fixes: 6d1d8050b4bc ("block, partition: add partition_meta_info to hd_struct")
Signed-off-by: Dan Carpenter 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

partitions/msdos: Unable to mount UFS 44bsd partitions

2018-05-23T01:33:55+00:00

[ Upstream commit 5f15684bd5e5ef39d4337988864fec8012471dda ]

UFS partitions from newer versions of FreeBSD 10 and 11 use relative
addressing for their subpartitions. But older versions of FreeBSD still
use absolute addressing just like OpenBSD and NetBSD.

Instead of simply testing for a FreeBSD partition, the code needs to
also test if the starting offset of the C subpartition is zero.

https://bugzilla.kernel.org/show_bug.cgi?id=197733

Signed-off-by: Richard Narron 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-throttle: make sure expire time isn't too big

2018-05-23T01:33:44+00:00

[ Upstream commit 06cceedcca67a93ac7f7aa93bbd9980c7496d14e ]

cgroup could be throttled to a limit but when all cgroups cross high
limit, queue enters a higher state and so the group should be throttled
to a higher limit. It's possible the cgroup is sleeping because of
throttle and other cgroups don't dispatch IO any more. In this case,
nobody can trigger current downgrade/upgrade logic. To fix this issue,
we could either set up a timer to wakeup the cgroup if other cgroups are
idle or make sure this cgroup doesn't sleep too long. Setting up a timer
means we must change the timer very frequently. This patch chooses the
latter. Making cgroup sleep time not too big wouldn't change cgroup
bps/iops, but could make it wakeup more frequently, which isn't a big
issue because throtl_slice * 8 is already quite big.

Signed-off-by: Shaohua Li 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blkcg: fix double free of new_blkg in blkcg_init_queue

2018-03-21T03:49:54+00:00

[ Upstream commit 9b54d816e00425c3a517514e0d677bb3cec49258 ]

If blkg_create fails, new_blkg passed as an argument will
be freed by blkg_create, so there is no need to free it again.

Signed-off-by: Hou Tao 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

more bio_map_user_iov() leak fixes

2018-02-28T22:38:14+00:00

[ Upstream commit 2b04e8f6bbb196cab4b232af0f8d48ff2c7a8058 ]

we need to take care of failure exit as well - pages already
in bio should be dropped by analogue of bio_unmap_pages(),
since their refcounts had been bumped only once per reference
in bio.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Sasha Levin

block: allow WRITE_SAME commands with the SG_IO ioctl

2018-01-17T17:31:44+00:00

[ Upstream commit 25cdb64510644f3e854d502d69c73f21c6df88a9 ]

The WRITE_SAME commands are not present in the blk_default_cmd_filter
write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
[ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]

The problem can be reproduced with the sg_write_same command

  # sg_write_same --num 1 --xferlen 512 /dev/sda
  #

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
    Write same: pass through os error: Operation not permitted
  #

For comparison, the WRITE_VERIFY command does not observe this problem,
since it is in that list:

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
  #

So, this patch adds the WRITE_SAME commands to the list, in order
for the SG_IO ioctl to finish successfully:

  # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
  #

That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
(qemu "-device scsi-block" [1], libvirt "" [2]),
which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
which are translated to write-same calls in the guest kernel, and then into
SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

  [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
  [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
  [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
  [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
  [...] blk_update_request: I/O error, dev sda, sector 17096824

Links:
[1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
[2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')

Signed-off-by: Mauricio Faria de Oliveira 
Signed-off-by: Brahadambal Srinivasan 
Reported-by: Manjunatha H R 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

fix unbalanced page refcounting in bio_map_user_iov

2017-11-06T04:54:26+00:00

[ Upstream commit 95d78c28b5a85bacbc29b8dba7c04babb9b0d467 ]

bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Cc: stable@vger.kernel.org
Signed-off-by: Vitaly Mayatskikh 
Signed-off-by: Al Viro 
Signed-off-by: Sasha Levin

partitions/efi: Fix integer overflow in GPT size calculation

2017-11-06T04:54:15+00:00

[ Upstream commit c5082b70adfe8e1ea1cf4a8eff92c9f260e364d2 ]

If a GUID Partition Table claims to have more than 2**25 entries, the
calculation of the partition table size in alloc_read_gpt_entries() will
overflow a 32-bit integer and not enough space will be allocated for the
table.

Nothing seems to get written out of bounds, but later efi_partition() will
read up to 32768 bytes from a 128 byte buffer, possibly OOPSing or exposing
information to /proc/partitions and uevents.

The problem exists on both 64-bit and 32-bit platforms.

Fix the overflow and also print a meaningful debug message if the table
size is too large.

Signed-off-by: Alden Tondettar 
Acked-by: Ard Biesheuvel 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin