<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/arch/x86/kernel/kvmclock.c, branch v3.14</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>pvclock: detect watchdog reset at pvclock read</title>
<updated>2013-11-06T07:48:43+00:00</updated>
<author>
<name>Marcelo Tosatti</name>
<email>mtosatti@redhat.com</email>
</author>
<published>2013-10-12T00:39:25+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=d63285e94af3ade4fa8b10b0d9a22bcf72baf2f9'/>
<id>d63285e94af3ade4fa8b10b0d9a22bcf72baf2f9</id>
<content type='text'>
Implement reset of kernel watchdogs at pvclock read time. This avoids
adding special code to every watchdog.

This is possible for watchdogs which measure time based on sched_clock() or
ktime_get() variants.

Suggested by Don Zickus.

Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Acked-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Signed-off-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Signed-off-by: Gleb Natapov &lt;gleb@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Implement reset of kernel watchdogs at pvclock read time. This avoids
adding special code to every watchdog.

This is possible for watchdogs which measure time based on sched_clock() or
ktime_get() variants.

Suggested by Don Zickus.

Acked-by: Don Zickus &lt;dzickus@redhat.com&gt;
Acked-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Signed-off-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Signed-off-by: Gleb Natapov &lt;gleb@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>x86: delete __cpuinit usage from all x86 files</title>
<updated>2013-07-14T23:36:56+00:00</updated>
<author>
<name>Paul Gortmaker</name>
<email>paul.gortmaker@windriver.com</email>
</author>
<published>2013-06-18T22:23:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=148f9bb87745ed45f7a11b2cbd3bc0f017d5d257'/>
<id>148f9bb87745ed45f7a11b2cbd3bc0f017d5d257</id>
<content type='text'>
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit  -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings.  In any case, they are temporary and harmless.

This removes all the arch/x86 uses of the __cpuinit macros from
all C files.  x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: x86@kernel.org
Acked-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: H. Peter Anvin &lt;hpa@linux.intel.com&gt;
Signed-off-by: Paul Gortmaker &lt;paul.gortmaker@windriver.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

Note that some harmless section mismatch warnings may result, since
notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c)
are flagged as __cpuinit  -- so if we remove the __cpuinit from
arch specific callers, we will also get section mismatch warnings.
As an intermediate step, we intend to turn the linux/init.h cpuinit
content into no-ops as early as possible, since that will get rid
of these warnings.  In any case, they are temporary and harmless.

This removes all the arch/x86 uses of the __cpuinit macros from
all C files.  x86 only had the one __CPUINIT used in assembly files,
and it wasn't paired off with a .previous or a __FINIT, so we can
delete it directly w/o any corresponding additional change there.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: "H. Peter Anvin" &lt;hpa@zytor.com&gt;
Cc: x86@kernel.org
Acked-by: Ingo Molnar &lt;mingo@kernel.org&gt;
Acked-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Acked-by: H. Peter Anvin &lt;hpa@linux.intel.com&gt;
Signed-off-by: Paul Gortmaker &lt;paul.gortmaker@windriver.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'timers/posix-cpu-timers-for-tglx' of</title>
<updated>2013-07-04T21:11:22+00:00</updated>
<author>
<name>Thomas Gleixner</name>
<email>tglx@linutronix.de</email>
</author>
<published>2013-07-04T21:11:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=2b0f89317e99735bbf32eaede81f707f98ab1b5e'/>
<id>2b0f89317e99735bbf32eaede81f707f98ab1b5e</id>
<content type='text'>
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core

Frederic sayed: "Most of these patches have been hanging around for
several month now, in -mmotm for a significant chunk. They already
missed a few releases."

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/core

Frederic sayed: "Most of these patches have been hanging around for
several month now, in -mmotm for a significant chunk. They already
missed a few releases."

Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>x86: kvmclock: zero initialize pvclock shared memory area</title>
<updated>2013-06-19T10:25:28+00:00</updated>
<author>
<name>Igor Mammedov</name>
<email>imammedo@redhat.com</email>
</author>
<published>2013-06-10T16:31:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=07868fc6aaf57847b0f3a3d53086b7556eb83f4a'/>
<id>07868fc6aaf57847b0f3a3d53086b7556eb83f4a</id>
<content type='text'>
kernel might hung in pvclock_clocksource_read() due to
uninitialized memory might contain odd version value in
following cycle:

        do {
                version = __pvclock_read_cycles(src, &amp;ret, &amp;flags);
        } while ((src-&gt;version &amp; 1) || version != src-&gt;version);

if secondary kvmclock is accessed before it's registered with kvm.

Clear garbage in pvclock shared memory area right after it's
allocated to avoid this issue.

Ref: https://bugzilla.kernel.org/show_bug.cgi?id=59521
Signed-off-by: Igor Mammedov &lt;imammedo@redhat.com&gt;
[See BZ for analysis.  We may want a different fix for 3.11, but
 this is the safest for now - Paolo]
Cc: &lt;stable@vger.kernel.org&gt; # 3.8
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
kernel might hung in pvclock_clocksource_read() due to
uninitialized memory might contain odd version value in
following cycle:

        do {
                version = __pvclock_read_cycles(src, &amp;ret, &amp;flags);
        } while ((src-&gt;version &amp; 1) || version != src-&gt;version);

if secondary kvmclock is accessed before it's registered with kvm.

Clear garbage in pvclock shared memory area right after it's
allocated to avoid this issue.

Ref: https://bugzilla.kernel.org/show_bug.cgi?id=59521
Signed-off-by: Igor Mammedov &lt;imammedo@redhat.com&gt;
[See BZ for analysis.  We may want a different fix for 3.11, but
 this is the safest for now - Paolo]
Cc: &lt;stable@vger.kernel.org&gt; # 3.8
Signed-off-by: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>x86: Increase precision of x86_platform.get/set_wallclock()</title>
<updated>2013-05-28T21:00:59+00:00</updated>
<author>
<name>David Vrabel</name>
<email>david.vrabel@citrix.com</email>
</author>
<published>2013-05-13T17:56:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3565184ed0c1ea46bea5b792da5f72a83c43e49b'/>
<id>3565184ed0c1ea46bea5b792da5f72a83c43e49b</id>
<content type='text'>
All the virtualized platforms (KVM, lguest and Xen) have persistent
wallclocks that have more than one second of precision.

read_persistent_wallclock() and update_persistent_wallclock() allow
for nanosecond precision but their implementation on x86 with
x86_platform.get/set_wallclock() only allows for one second precision.
This means guests may see a wallclock time that is off by up to 1
second.

Make set_wallclock() and get_wallclock() take a struct timespec
parameter (which allows for nanosecond precision) so KVM and Xen
guests may start with a more accurate wallclock time and a Xen dom0
can maintain a more accurate wallclock for guests.

Signed-off-by: David Vrabel &lt;david.vrabel@citrix.com&gt;
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
All the virtualized platforms (KVM, lguest and Xen) have persistent
wallclocks that have more than one second of precision.

read_persistent_wallclock() and update_persistent_wallclock() allow
for nanosecond precision but their implementation on x86 with
x86_platform.get/set_wallclock() only allows for one second precision.
This means guests may see a wallclock time that is off by up to 1
second.

Make set_wallclock() and get_wallclock() take a struct timespec
parameter (which allows for nanosecond precision) so KVM and Xen
guests may start with a more accurate wallclock time and a Xen dom0
can maintain a more accurate wallclock for guests.

Signed-off-by: David Vrabel &lt;david.vrabel@citrix.com&gt;
Signed-off-by: John Stultz &lt;john.stultz@linaro.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'master' into queue</title>
<updated>2013-03-04T23:10:32+00:00</updated>
<author>
<name>Marcelo Tosatti</name>
<email>mtosatti@redhat.com</email>
</author>
<published>2013-03-04T23:10:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ee2c25efdd46d7ed5605d6fe877bdf4b47a4ab2e'/>
<id>ee2c25efdd46d7ed5605d6fe877bdf4b47a4ab2e</id>
<content type='text'>
* master: (15791 commits)
  Linux 3.9-rc1
  btrfs/raid56: Add missing #include &lt;linux/vmalloc.h&gt;
  fix compat_sys_rt_sigprocmask()
  SUNRPC: One line comment fix
  ext4: enable quotas before orphan cleanup
  ext4: don't allow quota mount options when quota feature enabled
  ext4: fix a warning from sparse check for ext4_dir_llseek
  ext4: convert number of blocks to clusters properly
  ext4: fix possible memory leak in ext4_remount()
  jbd2: fix ERR_PTR dereference in jbd2__journal_start
  metag: Provide dma_get_sgtable()
  metag: prom.h: remove declaration of metag_dt_memblock_reserve()
  metag: copy devicetree to non-init memory
  metag: cleanup metag_ksyms.c includes
  metag: move mm/init.c exports out of metag_ksyms.c
  metag: move usercopy.c exports out of metag_ksyms.c
  metag: move setup.c exports out of metag_ksyms.c
  metag: move kick.c exports out of metag_ksyms.c
  metag: move traps.c exports out of metag_ksyms.c
  metag: move irq enable out of irqflags.h on SMP
  ...

Signed-off-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;

Conflicts:
	arch/x86/kernel/kvmclock.c
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
* master: (15791 commits)
  Linux 3.9-rc1
  btrfs/raid56: Add missing #include &lt;linux/vmalloc.h&gt;
  fix compat_sys_rt_sigprocmask()
  SUNRPC: One line comment fix
  ext4: enable quotas before orphan cleanup
  ext4: don't allow quota mount options when quota feature enabled
  ext4: fix a warning from sparse check for ext4_dir_llseek
  ext4: convert number of blocks to clusters properly
  ext4: fix possible memory leak in ext4_remount()
  jbd2: fix ERR_PTR dereference in jbd2__journal_start
  metag: Provide dma_get_sgtable()
  metag: prom.h: remove declaration of metag_dt_memblock_reserve()
  metag: copy devicetree to non-init memory
  metag: cleanup metag_ksyms.c includes
  metag: move mm/init.c exports out of metag_ksyms.c
  metag: move usercopy.c exports out of metag_ksyms.c
  metag: move setup.c exports out of metag_ksyms.c
  metag: move kick.c exports out of metag_ksyms.c
  metag: move traps.c exports out of metag_ksyms.c
  metag: move irq enable out of irqflags.h on SMP
  ...

Signed-off-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;

Conflicts:
	arch/x86/kernel/kvmclock.c
</pre>
</div>
</content>
</entry>
<entry>
<title>x86: kvmclock: Do not setup kvmclock vsyscall in the absence of that clock</title>
<updated>2013-02-27T11:19:18+00:00</updated>
<author>
<name>Jan Kiszka</name>
<email>jan.kiszka@siemens.com</email>
</author>
<published>2013-02-23T16:05:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=fe1140cc369410a9c206fdb7aaabc644bd213dc2'/>
<id>fe1140cc369410a9c206fdb7aaabc644bd213dc2</id>
<content type='text'>
This fixes boot lockups with "no-kvmclock", when the host is not
exposing this particular feature (QEMU: -cpu ...,-kvmclock) or when
the kvmclock initialization failed for whatever reason.

Reviewed-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Signed-off-by: Jan Kiszka &lt;jan.kiszka@siemens.com&gt;
Signed-off-by: Gleb Natapov &lt;gleb@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This fixes boot lockups with "no-kvmclock", when the host is not
exposing this particular feature (QEMU: -cpu ...,-kvmclock) or when
the kvmclock initialization failed for whatever reason.

Reviewed-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Signed-off-by: Jan Kiszka &lt;jan.kiszka@siemens.com&gt;
Signed-off-by: Gleb Natapov &lt;gleb@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm</title>
<updated>2013-02-24T21:07:18+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-02-24T21:07:18+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=89f883372fa60f604d136924baf3e89ff1870e9e'/>
<id>89f883372fa60f604d136924baf3e89ff1870e9e</id>
<content type='text'>
Pull KVM updates from Marcelo Tosatti:
 "KVM updates for the 3.9 merge window, including x86 real mode
  emulation fixes, stronger memory slot interface restrictions, mmu_lock
  spinlock hold time reduction, improved handling of large page faults
  on shadow, initial APICv HW acceleration support, s390 channel IO
  based virtio, amongst others"

* tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  Revert "KVM: MMU: lazily drop large spte"
  x86: pvclock kvm: align allocation size to page size
  KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
  x86 emulator: fix parity calculation for AAD instruction
  KVM: PPC: BookE: Handle alignment interrupts
  booke: Added DBCR4 SPR number
  KVM: PPC: booke: Allow multiple exception types
  KVM: PPC: booke: use vcpu reference from thread_struct
  KVM: Remove user_alloc from struct kvm_memory_slot
  KVM: VMX: disable apicv by default
  KVM: s390: Fix handling of iscs.
  KVM: MMU: cleanup __direct_map
  KVM: MMU: remove pt_access in mmu_set_spte
  KVM: MMU: cleanup mapping-level
  KVM: MMU: lazily drop large spte
  KVM: VMX: cleanup vmx_set_cr0().
  KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
  KVM: VMX: disable SMEP feature when guest is in non-paging mode
  KVM: Remove duplicate text in api.txt
  Revert "KVM: MMU: split kvm_mmu_free_page"
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull KVM updates from Marcelo Tosatti:
 "KVM updates for the 3.9 merge window, including x86 real mode
  emulation fixes, stronger memory slot interface restrictions, mmu_lock
  spinlock hold time reduction, improved handling of large page faults
  on shadow, initial APICv HW acceleration support, s390 channel IO
  based virtio, amongst others"

* tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  Revert "KVM: MMU: lazily drop large spte"
  x86: pvclock kvm: align allocation size to page size
  KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
  x86 emulator: fix parity calculation for AAD instruction
  KVM: PPC: BookE: Handle alignment interrupts
  booke: Added DBCR4 SPR number
  KVM: PPC: booke: Allow multiple exception types
  KVM: PPC: booke: use vcpu reference from thread_struct
  KVM: Remove user_alloc from struct kvm_memory_slot
  KVM: VMX: disable apicv by default
  KVM: s390: Fix handling of iscs.
  KVM: MMU: cleanup __direct_map
  KVM: MMU: remove pt_access in mmu_set_spte
  KVM: MMU: cleanup mapping-level
  KVM: MMU: lazily drop large spte
  KVM: VMX: cleanup vmx_set_cr0().
  KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
  KVM: VMX: disable SMEP feature when guest is in non-paging mode
  KVM: Remove duplicate text in api.txt
  Revert "KVM: MMU: split kvm_mmu_free_page"
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>x86: pvclock kvm: align allocation size to page size</title>
<updated>2013-02-19T02:05:19+00:00</updated>
<author>
<name>Marcelo Tosatti</name>
<email>mtosatti@redhat.com</email>
</author>
<published>2013-02-19T01:58:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ed55705dd5008b408c48a8459b8b34b01f3de985'/>
<id>ed55705dd5008b408c48a8459b8b34b01f3de985</id>
<content type='text'>
To match whats mapped via vsyscalls to userspace.

Reported-by: Peter Hurley &lt;peter@hurleysoftware.com&gt;
Signed-off-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
To match whats mapped via vsyscalls to userspace.

Reported-by: Peter Hurley &lt;peter@hurleysoftware.com&gt;
Signed-off-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>x86, kvm: Fix kvm's use of __pa() on percpu areas</title>
<updated>2013-01-26T00:34:55+00:00</updated>
<author>
<name>Dave Hansen</name>
<email>dave@linux.vnet.ibm.com</email>
</author>
<published>2013-01-22T21:24:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d'/>
<id>5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d</id>
<content type='text'>
In short, it is illegal to call __pa() on an address holding
a percpu variable.  This replaces those __pa() calls with
slow_virt_to_phys().  All of the cases in this patch are
in boot time (or CPU hotplug time at worst) code, so the
slow pagetable walking in slow_virt_to_phys() is not expected
to have a performance impact.

The times when this actually matters are pretty obscure
(certain 32-bit NUMA systems), but it _does_ happen.  It is
important to keep KVM guests working on these systems because
the real hardware is getting harder and harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

	CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[    0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.  The behavior varied,
though, depending on what got corrupted by the bad write.

Signed-off-by: Dave Hansen &lt;dave@linux.vnet.ibm.com&gt;
Link: http://lkml.kernel.org/r/20130122212435.4905663F@kernel.stglabs.ibm.com
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Signed-off-by: H. Peter Anvin &lt;hpa@linux.intel.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
In short, it is illegal to call __pa() on an address holding
a percpu variable.  This replaces those __pa() calls with
slow_virt_to_phys().  All of the cases in this patch are
in boot time (or CPU hotplug time at worst) code, so the
slow pagetable walking in slow_virt_to_phys() is not expected
to have a performance impact.

The times when this actually matters are pretty obscure
(certain 32-bit NUMA systems), but it _does_ happen.  It is
important to keep KVM guests working on these systems because
the real hardware is getting harder and harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

	CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[    0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.  The behavior varied,
though, depending on what got corrupted by the bad write.

Signed-off-by: Dave Hansen &lt;dave@linux.vnet.ibm.com&gt;
Link: http://lkml.kernel.org/r/20130122212435.4905663F@kernel.stglabs.ibm.com
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Reviewed-by: Marcelo Tosatti &lt;mtosatti@redhat.com&gt;
Signed-off-by: H. Peter Anvin &lt;hpa@linux.intel.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
