<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/fs/jbd/commit.c, branch v2.6.30</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>jbd: fix race in buffer processing in commit code</title>
<updated>2009-06-09T23:59:03+00:00</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2009-06-09T23:26:26+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a61d90d75d0f9e86432c45b496b4b0fbf0fd03dc'/>
<id>a61d90d75d0f9e86432c45b496b4b0fbf0fd03dc</id>
<content type='text'>
In commit code, we scan buffers attached to a transaction.  During this
scan, we sometimes have to drop j_list_lock and then we recheck whether
the journal buffer head didn't get freed by journal_try_to_free_buffers().
 But checking for buffer_jbd(bh) isn't enough because a new journal head
could get attached to our buffer head.  So add a check whether the journal
head remained the same and whether it's still at the same transaction and
list.

This is a nasty bug and can cause problems like memory corruption (use after
free) or trigger various assertions in JBD code (observed).

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: &lt;stable@kernel.org&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In commit code, we scan buffers attached to a transaction.  During this
scan, we sometimes have to drop j_list_lock and then we recheck whether
the journal buffer head didn't get freed by journal_try_to_free_buffers().
 But checking for buffer_jbd(bh) isn't enough because a new journal head
could get attached to our buffer head.  So add a check whether the journal
head remained the same and whether it's still at the same transaction and
list.

This is a nasty bug and can cause problems like memory corruption (use after
free) or trigger various assertions in JBD code (observed).

Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: &lt;stable@kernel.org&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records</title>
<updated>2009-04-14T14:10:47+00:00</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2009-04-14T14:10:47+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=38d726d153cfe5efe5fe22d28d36ab382dda3a5c'/>
<id>38d726d153cfe5efe5fe22d28d36ab382dda3a5c</id>
<content type='text'>
The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNC</title>
<updated>2009-04-06T15:04:53+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>jens.axboe@oracle.com</email>
</author>
<published>2009-04-06T12:48:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=6c4bac6b3351fd278dc3537dae42f88f733ff12e'/>
<id>6c4bac6b3351fd278dc3537dae42f88f733ff12e</id>
<content type='text'>
When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe &lt;jens.axboe@oracle.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe &lt;jens.axboe@oracle.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ext3: Use WRITE_SYNC for commits which are caused by fsync()</title>
<updated>2009-03-28T02:14:27+00:00</updated>
<author>
<name>Theodore Ts'o</name>
<email>tytso@mit.edu</email>
</author>
<published>2009-03-28T02:14:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=512a004382f2c60d5c4f855476ba965adc00250c'/>
<id>512a004382f2c60d5c4f855476ba965adc00250c</id>
<content type='text'>
If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o" &lt;tytso@mit.edu&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>jbd: improve fsync batching</title>
<updated>2009-01-08T16:31:00+00:00</updated>
<author>
<name>Josef Bacik</name>
<email>jbacik@redhat.com</email>
</author>
<published>2009-01-08T02:07:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f420d4dc4272fd223986762df2ad06056ddebada'/>
<id>f420d4dc4272fd223986762df2ad06056ddebada</id>
<content type='text'>
There is a flaw with the way jbd handles fsync batching.  If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit.  The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going.  Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.

This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction.  If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk.  We acheive
sub-jiffie sleeping using schedule_hrtimeout.  This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout.  I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average.  I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.

I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array.  I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk.  I ran the following command

fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i

where $i was 2, 4, 8, 16 and 32.  I mkfs'ed the fs each time.  Here are my
results

type	threads		with patch	without patch
sata	2		24.6		26.3
sata	4		49.2		48.1
sata	8		70.1		67.0
sata	16		104.0		94.1
sata	32		153.6		142.7

xserve	2		246.4		222.0
xserve	4		480.0		440.8
xserve	8		829.5		730.8
xserve	16		1172.7		1026.9
xserve	32		1816.3		1650.5

ramdisk	2		2538.3		1745.6
ramdisk	4		2942.3		661.9
ramdisk	8		2882.5		999.8
ramdisk	16		2738.7		1801.9
ramdisk	32		2541.9		2394.0

Signed-off-by: Josef Bacik &lt;jbacik@redhat.com&gt;
Cc: Andreas Dilger &lt;adilger@sun.com&gt;
Cc: Arjan van de Ven &lt;arjan@infradead.org&gt;
Cc: Ric Wheeler &lt;rwheeler@redhat.com&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is a flaw with the way jbd handles fsync batching.  If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit.  The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going.  Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.

This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction.  If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk.  We acheive
sub-jiffie sleeping using schedule_hrtimeout.  This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout.  I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average.  I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.

I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array.  I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk.  I ran the following command

fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i

where $i was 2, 4, 8, 16 and 32.  I mkfs'ed the fs each time.  Here are my
results

type	threads		with patch	without patch
sata	2		24.6		26.3
sata	4		49.2		48.1
sata	8		70.1		67.0
sata	16		104.0		94.1
sata	32		153.6		142.7

xserve	2		246.4		222.0
xserve	4		480.0		440.8
xserve	8		829.5		730.8
xserve	16		1172.7		1026.9
xserve	32		1816.3		1650.5

ramdisk	2		2538.3		1745.6
ramdisk	4		2942.3		661.9
ramdisk	8		2882.5		999.8
ramdisk	16		2738.7		1801.9
ramdisk	32		2541.9		2394.0

Signed-off-by: Josef Bacik &lt;jbacik@redhat.com&gt;
Cc: Andreas Dilger &lt;adilger@sun.com&gt;
Cc: Arjan van de Ven &lt;arjan@infradead.org&gt;
Cc: Ric Wheeler &lt;rwheeler@redhat.com&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ext3: add an option to control error handling on file data</title>
<updated>2008-10-20T15:52:37+00:00</updated>
<author>
<name>Hidehiro Kawai</name>
<email>hidehiro.kawai.ez@hitachi.com</email>
</author>
<published>2008-10-19T03:27:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0e4fb5e283870757024294bc4567a7c59d936f0b'/>
<id>0e4fb5e283870757024294bc4567a7c59d936f0b</id>
<content type='text'>
If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently.  Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error.  It's scary for mission critical systems.  On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable.  So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.

If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error.  If you mount it with data_err=ignore, it doesn't abort, just
call printk().  data_err=ignore is the default.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Cc: Jan Kara &lt;jack@ucw.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently.  Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error.  It's scary for mission critical systems.  On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable.  So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.

If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error.  If you mount it with data_err=ignore, it doesn't abort, just
call printk().  data_err=ignore is the default.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Cc: Jan Kara &lt;jack@ucw.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>jbd: don't dirty original metadata buffer on abort</title>
<updated>2008-10-20T15:52:37+00:00</updated>
<author>
<name>Hidehiro Kawai</name>
<email>hidehiro.kawai.ez@hitachi.com</email>
</author>
<published>2008-10-19T03:27:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=885e353c7427db7b60692789741b34e605b0b69b'/>
<id>885e353c7427db7b60692789741b34e605b0b69b</id>
<content type='text'>
Currently, original metadata buffers are dirtied when they are unfiled
whether the journal has aborted or not.  Eventually these buffers will be
written-back to the filesystem by pdflush.  This means some metadata
buffers are written to the filesystem without journaling if the journal
aborts.  So if both journal abort and system crash happen at the same
time, the filesystem would become inconsistent state.  Additionally,
replaying journaled metadata can overwrite the latest metadata on the
filesystem partly.  Because, if the journal aborts, journaled metadata are
preserved and replayed during the next mount not to lose uncheckpointed
metadata.  This would also break the consistency of the filesystem.

This patch prevents original metadata buffers from being dirtied on abort
by clearing BH_JBDDirty flag from those buffers.  Thus, no metadata
buffers are written to the filesystem without journaling.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Currently, original metadata buffers are dirtied when they are unfiled
whether the journal has aborted or not.  Eventually these buffers will be
written-back to the filesystem by pdflush.  This means some metadata
buffers are written to the filesystem without journaling if the journal
aborts.  So if both journal abort and system crash happen at the same
time, the filesystem would become inconsistent state.  Additionally,
replaying journaled metadata can overwrite the latest metadata on the
filesystem partly.  Because, if the journal aborts, journaled metadata are
preserved and replayed during the next mount not to lose uncheckpointed
metadata.  This would also break the consistency of the filesystem.

This patch prevents original metadata buffers from being dirtied on abort
by clearing BH_JBDDirty flag from those buffers.  Thus, no metadata
buffers are written to the filesystem without journaling.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>jbd: abort when failed to log metadata buffers</title>
<updated>2008-10-20T15:52:36+00:00</updated>
<author>
<name>Hidehiro Kawai</name>
<email>hidehiro.kawai.ez@hitachi.com</email>
</author>
<published>2008-10-19T03:27:53+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d1645e526a1e5842c9ac433d73419ba886676cf3'/>
<id>d1645e526a1e5842c9ac433d73419ba886676cf3</id>
<content type='text'>
If we failed to write metadata buffers to the journal space and succeeded
to write the commit record, stale data can be written back to the
filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers, abort the
journal before writing the commit record.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
If we failed to write metadata buffers to the journal space and succeeded
to write the commit record, stale data can be written back to the
filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers, abort the
journal before writing the commit record.

Signed-off-by: Hidehiro Kawai &lt;hidehiro.kawai.ez@hitachi.com&gt;
Acked-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: &lt;linux-ext4@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>fs: rename buffer trylock</title>
<updated>2008-08-05T04:56:09+00:00</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@suse.de</email>
</author>
<published>2008-08-02T10:02:13+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ca5de404ff036a29b25e9a83f6919c9f606c5841'/>
<id>ca5de404ff036a29b25e9a83f6919c9f606c5841</id>
<content type='text'>
Like the page lock change, this also requires name change, so convert the
raw test_and_set bitop to a trylock.

Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Like the page lock change, this also requires name change, so convert the
raw test_and_set bitop to a trylock.

Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: rename page trylock</title>
<updated>2008-08-05T04:31:34+00:00</updated>
<author>
<name>Nick Piggin</name>
<email>npiggin@suse.de</email>
</author>
<published>2008-08-02T10:01:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=529ae9aaa08378cfe2a4350bded76f32cc8ff0ce'/>
<id>529ae9aaa08378cfe2a4350bded76f32cc8ff0ce</id>
<content type='text'>
Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock =&gt; trylock_page, SetPageLocked =&gt; set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Acked-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock =&gt; trylock_page, SetPageLocked =&gt; set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin &lt;npiggin@suse.de&gt;
Acked-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Acked-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Acked-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
