linux-stable.git/block/genhd.c, branch linux-3.1.y

block: make gendisk hold a reference to its queue

2011-11-11T17:44:30+00:00

commit f992ae801a7dec34a4ed99a6598bbbbfb82af4fb upstream.

The following command sequence triggers an oops.

# mount /dev/sdb1 /mnt
# echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete
# umount /mnt

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:

 Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs
 RIP: 0010:[]  [] __lock_acquire+0x389/0x1d60
...
 Call Trace:
  [] lock_acquire+0x95/0x140
  [] _raw_spin_lock+0x3b/0x50
  [] bdi_lock_two+0x5c/0x70
  [] bdev_inode_switch_bdi+0x4c/0xf0
  [] __blkdev_put+0x11b/0x1d0
  [] __blkdev_put+0x160/0x1d0
  [] blkdev_put+0x5f/0x190
  [] kill_block_super+0x4d/0x80
  [] deactivate_locked_super+0x45/0x70
  [] deactivate_super+0x4a/0x70
  [] mntput_no_expire+0xed/0x130
  [] sys_umount+0x7e/0x3a0
  [] system_call_fastpath+0x16/0x1b

This is because bdev holds on to disk but disk doesn't pin the
associated queue.  If a SCSI device is removed while the device is
still open, the sdev puts the base reference to the queue on release.
When the bdev is finally released, the associated queue is already
gone along with the bdi and bdev_inode_switch_bdi() ends up
dereferencing already freed bdi.

Even if it were not for this bug, disk not holding onto the associated
queue is very unusual and error-prone.

Fix it by making add_disk() take an extra reference to its queue and
put it on disk_release() and ensuring that disk and its fops owner are
put in that order after all accesses to the disk and queue are
complete.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block/genhd.c: remove useless cast in diskstats_show()

2011-08-02T10:43:50+00:00

Remove the (unsigned long long) cast in diskstats_show() and adjusts the
seq_printf() format string to 'unsigned long'

diskstats_show() uses part_stat_read() to get the stats, which either
accesses the specified field in the struct disk_stats directly (non SMP)
or sums up the per CPU values in a variable of the same type as the field,
so in any case the result will have the same type and range as the
specified field which for all disk_stats entries is unsigned long

Also, for unsigned long ranges the output of %lu should be identical to
the one of %llu, so no change in the actual proc entry contents.

Signed-off-by: Herbert Poetzl 
Cc: Jens Axboe 
Signed-off-by: Andrew Morton 
Signed-off-by: Jens Axboe

Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block

2011-07-25T17:33:36+00:00

* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
  block: strict rq_affinity
  backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
  block: fix patch import error in max_discard_sectors check
  block: reorder request_queue to remove 64 bit alignment padding
  CFQ: add think time check for group
  CFQ: add think time check for service tree
  CFQ: move think time check variables to a separate struct
  fixlet: Remove fs_excl from struct task.
  cfq: Remove special treatment for metadata rqs.
  block: document blk_plug list access
  block: avoid building too big plug list
  compat_ioctl: fix make headers_check regression
  block: eliminate potential for infinite loop in blkdev_issue_discard
  compat_ioctl: fix warning caused by qemu
  block: flush MEDIA_CHANGE from drivers on close(2)
  blk-throttle: Make total_nr_queued unsigned
  block: Add __attribute__((format(printf...) and fix fallout
  fs/partitions/check.c: make local symbols static
  block:remove some spare spaces in genhd.c
  block:fix the comment error in blkdev.h
  ...

block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()

2011-07-20T21:10:13+00:00

The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb).

Signed-off-by: Lai Jiangshan 
Signed-off-by: Paul E. McKenney 
Cc: Jens Axboe 
Reviewed-by: Josh Triplett

block: flush MEDIA_CHANGE from drivers on close(2)

2011-07-01T14:17:47+00:00

Currently, only open(2) is defined as the 'clearing' point.  It has
two roles - first, it's an acknowledgement from userland indicating
that the event has been received and kernel can clear pending states
and proceed to generate more events.  Secondly, it's passed on to
device drivers as a hint indicating that a synchronization point has
been reached and it might want to take a deeper look at the device.

The latter currently is only used by sr which uses two different
mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
to discover events, where the former is lighter weight and safe to be
used repeatedly but may not provide full coverage.  Among other
things, GET_EVENT can't detect media removal while TUR can.

This patch makes close(2) - blkdev_put() - indicate clearing hint for
MEDIA_CHANGE to drivers.  disk_check_events() is renamed to
disk_flush_events() and updated to take @mask for events to flush
which is or'd to ev->clearing and will be passed to the driver on the
next ->check_events() invocation.

This change makes sr generate MEDIA_CHANGE when media is ejected from
userland - e.g. with eject(1).

Note: Given the current usage, it seems @clearing hint is needlessly
complex.  disk_clear_events() can simply clear all events and the hint
can be boolean @flush.

Signed-off-by: Tejun Heo 
Cc: Kay Sievers 
Signed-off-by: Jens Axboe

Merge branch 'for-linus' into for-3.1/core

2011-07-01T14:17:13+00:00

Conflicts:
	block/blk-throttle.c
	block/cfq-iosched.c

Signed-off-by: Jens Axboe

block:remove some spare spaces in genhd.c

2011-06-13T08:45:43+00:00

Remove the end-of-line spaces in genhd.c.

Signed-off-by: Wanlong Gao 
Signed-off-by: Jens Axboe

block: make disk_block_events() properly wait for work cancellation

2011-06-09T18:43:59+00:00

disk_block_events() should guarantee that the event work is not in
flight on return and once blocked it shouldn't issue further
cancellations.

Because there was no synchronization between the first blocker doing
cancel_delayed_work_sync() and the following blockers, the following
blockers could finish before cancellation was complete, which broke
both guarantees - event work could be in flight and cancellation could
happen after return.

This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

Fix it by adding an outer mutex which protects both block count
manipulation and work cancellation.

-v2: Use outer mutex instead of bit waitqueue per Linus.

Signed-off-by: Tejun Heo 
Tested-by: Sitsofe Wheeler 
Reported-by: Sitsofe Wheeler 
Reported-by: Borislav Petkov 
Reported-by: Meelis Roos 
Reported-by: Linus Torvalds 
Cc: Andrew Morton 
Cc: Jens Axboe 
Cc: Kay Sievers 
Signed-off-by: Jens Axboe

block: remove non-syncing __disk_block_events() and fold it into disk_block_events()

2011-06-09T18:43:55+00:00

After the previous update to disk_check_events(), nobody is using
non-syncing __disk_block_events().  Remove @sync and, as this makes
__disk_block_events() virtually identical to disk_block_events(),
remove the underscore prefixed version.

Signed-off-by: Tejun Heo 
Cc: Jens Axboe 
Signed-off-by: Jens Axboe

block: don't use non-syncing event blocking in disk_check_events()

2011-06-09T18:43:54+00:00

This patch is part of fix for triggering of WARN_ON_ONCE() in
disk_clear_events() reported in bug#34662.

  https://bugzilla.kernel.org/show_bug.cgi?id=34662

disk_clear_events() blocks events, schedules and flushes the event
work.  It expects the work to have started execution on schedule and
finished on return from flush.  WARN_ON_ONCE() triggers if the event
work hasn't executed as expected.  This problem happens because
__disk_block_events() fails to guarantee that the event work item is
not in flight on return from the function in race-free manner.  The
problem is two-fold and this patch addresses one of them.

When __disk_block_events() is called with @sync == %false, it bumps
event block count, calls cancel_delayed_work() and return.  This makes
it impossible to guarantee that event polling is not in flight on
return from syncing __disk_block_events() - if the first blocker was
non-syncing, polling could still be in progress and later syncing ones
would assume that the first blocker already canceled it.

Making __disk_block_events() cancel_sync regardless of block count
isn't feasible either as it may race with forced event checking in
disk_clear_events().

As disk_check_events() is the only user of non-syncing
__disk_block_events(), updating it to directly cancel and schedule
event work is the easiest way to solve the issue.

Note that there's another bug in __disk_block_events() and this patch
doesn't fix the issue completely.  Later patch will fix the other bug.

Signed-off-by: Tejun Heo 
Tested-by: Sitsofe Wheeler 
Reported-by: Sitsofe Wheeler 
Reported-by: Borislav Petkov 
Reported-by: Meelis Roos 
Reported-by: Linus Torvalds 
Cc: Andrew Morton 
Cc: Jens Axboe 
Cc: Kay Sievers 
Signed-off-by: Jens Axboe