linux.git/arch/x86/include/asm, branch v3.15

x86_64: expand kernel stack to 16K

2014-05-30T18:52:51+00:00

While I play inhouse patches with much memory pressure on qemu-kvm,
3.14 kernel was randomly crashed. The reason was kernel stack overflow.

When I investigated the problem, the callstack was a little bit deeper
by involve with reclaim functions but not direct reclaim path.

I tried to diet stack size of some functions related with alloc/reclaim
so did a hundred of byte but overflow was't disappeard so that I encounter
overflow by another deeper callstack on reclaim/allocator path.

Of course, we might sweep every sites we have found for reducing
stack usage but I'm not sure how long it saves the world(surely,
lots of developer start to add nice features which will use stack
agains) and if we consider another more complex feature in I/O layer
and/or reclaim path, it might be better to increase stack size(
meanwhile, stack usage on 64bit machine was doubled compared to 32bit
while it have sticked to 8K. Hmm, it's not a fair to me and arm64
already expaned to 16K. )

So, my stupid idea is just let's expand stack size and keep an eye
toward stack consumption on each kernel functions via stacktrace of ftrace.
For example, we can have a bar like that each funcion shouldn't exceed 200K
and emit the warning when some function consumes more in runtime.
Of course, it could make false positive but at least, it could make a
chance to think over it.

I guess this topic was discussed several time so there might be
strong reason not to increase kernel stack size on x86_64, for me not
knowing so Ccing x86_64 maintainers, other MM guys and virtio
maintainers.

Here's an example call trace using up the kernel stack:

         Depth    Size   Location    (51 entries)
         -----    ----   --------
   0)     7696      16   lookup_address
   1)     7680      16   _lookup_address_cpa.isra.3
   2)     7664      24   __change_page_attr_set_clr
   3)     7640     392   kernel_map_pages
   4)     7248     256   get_page_from_freelist
   5)     6992     352   __alloc_pages_nodemask
   6)     6640       8   alloc_pages_current
   7)     6632     168   new_slab
   8)     6464       8   __slab_alloc
   9)     6456      80   __kmalloc
  10)     6376     376   vring_add_indirect
  11)     6000     144   virtqueue_add_sgs
  12)     5856     288   __virtblk_add_req
  13)     5568      96   virtio_queue_rq
  14)     5472     128   __blk_mq_run_hw_queue
  15)     5344      16   blk_mq_run_hw_queue
  16)     5328      96   blk_mq_insert_requests
  17)     5232     112   blk_mq_flush_plug_list
  18)     5120     112   blk_flush_plug_list
  19)     5008      64   io_schedule_timeout
  20)     4944     128   mempool_alloc
  21)     4816      96   bio_alloc_bioset
  22)     4720      48   get_swap_bio
  23)     4672     160   __swap_writepage
  24)     4512      32   swap_writepage
  25)     4480     320   shrink_page_list
  26)     4160     208   shrink_inactive_list
  27)     3952     304   shrink_lruvec
  28)     3648      80   shrink_zone
  29)     3568     128   do_try_to_free_pages
  30)     3440     208   try_to_free_pages
  31)     3232     352   __alloc_pages_nodemask
  32)     2880       8   alloc_pages_current
  33)     2872     200   __page_cache_alloc
  34)     2672      80   find_or_create_page
  35)     2592      80   ext4_mb_load_buddy
  36)     2512     176   ext4_mb_regular_allocator
  37)     2336     128   ext4_mb_new_blocks
  38)     2208     256   ext4_ext_map_blocks
  39)     1952     160   ext4_map_blocks
  40)     1792     384   ext4_writepages
  41)     1408      16   do_writepages
  42)     1392      96   __writeback_single_inode
  43)     1296     176   writeback_sb_inodes
  44)     1120      80   __writeback_inodes_wb
  45)     1040     160   wb_writeback
  46)      880     208   bdi_writeback_workfn
  47)      672     144   process_one_work
  48)      528     112   worker_thread
  49)      416     240   kthread
  50)      176     176   ret_from_fork

[ Note: the problem is exacerbated by certain gcc versions that seem to
  generate much bigger stack frames due to apparently bad coalescing of
  temporaries and generating too many spills.  Rusty saw gcc-4.6.4 using
  35% more stack on the virtio path than 4.8.2 does, for example.

  Minchan not only uses such a bad gcc version (4.6.3 in his case), but
  some of the stack use is due to debugging (CONFIG_DEBUG_PAGEALLOC is
  what causes that kernel_map_pages() frame, for example). But we're
  clearly getting too close.

  The VM code also seems to have excessive stack frames partly for the
  same compiler reason, triggered by excessive inlining and lots of
  function arguments.

  We need to improve on our stack use, but in the meantime let's do this
  simple stack increase too.  Unlike most earlier reports, there is
  nothing simple that stands out as being really horribly wrong here,
  apart from the fact that the stack frames are just bigger than they
  should need to be.        - Linus ]

Signed-off-by: Minchan Kim 
Cc: Peter Anvin 
Cc: Dave Chinner 
Cc: Dave Jones 
Cc: Jens Axboe 
Cc: Andrew Morton 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Mel Gorman 
Cc: Rik van Riel 
Cc: Johannes Weiner 
Cc: Hugh Dickins 
Cc: Rusty Russell 
Cc: Michael S Tsirkin 
Cc: Dave Hansen 
Cc: Steven Rostedt 
Cc: PJ Waskiewicz 
Signed-off-by: Linus Torvalds

x86, mm, hugetlb: Add missing TLB page invalidation for hugetlb_cow()

2014-05-13T23:34:09+00:00

The invalidation is required in order to maintain proper semantics
under CoW conditions. In scenarios where a process clones several
threads, a thread operating on a core whose DTLB entry for a
particular hugepage has not been invalidated, will be reading from
the hugepage that belongs to the forked child process, even after
hugetlb_cow().

The thread will not see the updated page as long as the stale DTLB
entry remains cached, the thread attempts to write into the page,
the child process exits, or the thread gets migrated to a different
processor.

Signed-off-by: Anthony Iliopoulos 
Link: http://lkml.kernel.org/r/20140514092948.GA17391@server-36.huawei.corp
Suggested-by: Shay Goikhman 
Acked-by: Dave Hansen 
Signed-off-by: H. Peter Anvin 
Cc:  # v2.6.16+ (!)

x86/hpet: Make boot_hpet_disable extern

2014-05-08T06:15:34+00:00

HPET on some platform has accuracy problem. Making
"boot_hpet_disable" extern so that we can runtime disable
the HPET timer by using quirk to check the platform.

Signed-off-by: Feng Tang 
Cc: Clemens Ladisch 
Cc: John Stultz 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1398327498-13163-1-git-send-email-feng.tang@intel.com
Signed-off-by: Ingo Molnar

Merge git://git.kernel.org/pub/scm/virt/kvm/kvm

2014-04-14T23:21:28+00:00

Pull KVM fixes from Marcelo Tosatti:
 - Fix for guest triggerable BUG_ON (CVE-2014-0155)
 - CR4.SMAP support
 - Spurious WARN_ON() fix

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: remove WARN_ON from get_kernel_ns()
  KVM: Rename variable smep to cr4_smep
  KVM: expose SMAP feature to guest
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: Add SMAP support when setting CR4
  KVM: Remove SMAP bit from CR4_RESERVED_BITS
  KVM: ioapic: try to recover if pending_eoi goes out of range
  KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)

KVM: Remove SMAP bit from CR4_RESERVED_BITS

2014-04-14T20:50:33+00:00

This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu 
Signed-off-by: Marcelo Tosatti

Merge git://git.infradead.org/users/eparis/audit

2014-04-12T19:38:53+00:00

Pull audit updates from Eric Paris.

* git://git.infradead.org/users/eparis/audit: (28 commits)
  AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
  audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
  audit: do not cast audit_rule_data pointers pointlesly
  AUDIT: Allow login in non-init namespaces
  audit: define audit_is_compat in kernel internal header
  kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
  sched: declare pid_alive as inline
  audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
  syscall_get_arch: remove useless function arguments
  audit: remove stray newline from audit_log_execve_info() audit_panic() call
  audit: remove stray newlines from audit_log_lost messages
  audit: include subject in login records
  audit: remove superfluous new- prefix in AUDIT_LOGIN messages
  audit: allow user processes to log from another PID namespace
  audit: anchor all pid references in the initial pid namespace
  audit: convert PPIDs to the inital PID namespace.
  pid: get pid_t ppid of task in init_pid_ns
  audit: rename the misleading audit_get_context() to audit_take_context()
  audit: Add generic compat syscall support
  audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
  ...

x86: use generic early_ioremap

2014-04-07T23:36:15+00:00

Move x86 over to the generic early ioremap implementation.

Signed-off-by: Mark Salter 
Acked-by: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Dave Young 
Cc: Will Deacon 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86/mm: sparse warning fix for early_memremap

2014-04-07T23:36:14+00:00

This patch series takes the common bits from the x86 early ioremap
implementation and creates a generic implementation which may be used by
other architectures.  The early ioremap interfaces are intended for
situations where boot code needs to make temporary virtual mappings
before the normal ioremap interfaces are available.  Typically, this
means before paging_init() has run.

This patch (of 6):

There's a lot of sparse warnings for code like below: void *a =
early_memremap(phys_addr, size);

early_memremap intend to map kernel memory with ioremap facility, the
return pointer should be a kernel ram pointer instead of iomem one.

For making the function clearer and supressing sparse warnings this patch
do below two things:
1. cast to (__force void *) for the return value of early_memremap
2. add early_memunmap function and pass (__force void __iomem *) to iounmap

From Boris:
  "Ingo told me yesterday, it makes sense too.  I'd guess we can try it.
   FWIW, all callers of early_memremap use the memory they get remapped
   as normal memory so we should be safe"

Signed-off-by: Dave Young 
Signed-off-by: Mark Salter 
Acked-by: H. Peter Anvin 
Cc: Borislav Petkov 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

percpu: add raw_cpu_ops

2014-04-07T23:36:13+00:00

The kernel has never been audited to ensure that this_cpu operations are
consistently used throughout the kernel.  The code generated in many
places can be improved through the use of this_cpu operations (which
uses a segment register for relocation of per cpu offsets instead of
performing address calculations).

The patch set also addresses various consistency issues in general with
the per cpu macros.

A. The semantics of __this_cpu_ptr() differs from this_cpu_ptr only
   because checks are skipped. This is typically shown through a raw_
   prefix. So this patch set changes the places where __this_cpu_ptr()
   is used to raw_cpu_ptr().

B. There has been the long term wish by some that __this_cpu operations
   would check for preemption. However, there are cases where preemption
   checks need to be skipped. This patch set adds raw_cpu operations that
   do not check for preemption and then adds preemption checks to the
   __this_cpu operations.

C. The use of __get_cpu_var is always a reference to a percpu variable
   that can also be handled via a this_cpu operation. This patch set
   replaces all uses of __get_cpu_var with this_cpu operations.

D. We can then use this_cpu RMW operations in various places replacing
   sequences of instructions by a single one.

E. The use of this_cpu operations throughout will allow other arches than
   x86 to implement optimized references and RMV operations to work with
   per cpu local data.

F. The use of this_cpu operations opens up the possibility to
   further optimize code that relies on synchronization through
   per cpu data.

The patch set works in a couple of stages:

I. Patch 1 adds the additional raw_cpu operations and raw_cpu_ptr().
    Also converts the existing __this_cpu_xx_# primitive in the x86
    code to raw_cpu_xx_#.

II. Patch 2-4 use the raw_cpu operations in places that would give
     us false positives once they are enabled.

III. Patch 5 adds preemption checks to __this_cpu operations to allow
    checking if preemption is properly disabled when these functions
    are used.

IV. Patches 6-20 are patches that simply replace uses of __get_cpu_var
   with this_cpu_ptr. They do not depend on any changes to the percpu
   code. No preemption tests are skipped if they are applied.

V. Patches 21-46 are conversion patches that use this_cpu operations
   in various kernel subsystems/drivers or arch code.

VI.  Patches 47/48 (not included in this series) remove no longer used
    functions (__this_cpu_ptr and __get_cpu_var).  These should only be
    applied after all the conversion patches have made it and after we
    have done additional passes through the kernel to ensure that none of
    the uses of these functions remain.

This patch (of 46):

The patches following this one will add preemption checks to __this_cpu
ops so we need to have an alternative way to use this_cpu operations
without preemption checks.

raw_cpu_ops will be the basis for all other ops since these will be the
operations that do not implement any checks.

Primitive operations are renamed by this patch from __this_cpu_xxx to
raw_cpu_xxxx.

Also change the uses of the x86 percpu primitives in preempt.h.
These depend directly on asm/percpu.h (header #include nesting issue).

Signed-off-by: Peter Zijlstra 
Signed-off-by: Christoph Lameter 
Acked-by: Ingo Molnar 
Cc: Tejun Heo 
Cc: "James E.J. Bottomley" 
Cc: "Paul E. McKenney" 
Cc: Alex Shi 
Cc: Arnd Bergmann 
Cc: Benjamin Herrenschmidt 
Cc: Bryan Wu 
Cc: Catalin Marinas 
Cc: Chris Metcalf 
Cc: Daniel Lezcano 
Cc: David Daney 
Cc: David Miller 
Cc: David S. Miller 
Cc: Dimitri Sivanich 
Cc: Dipankar Sarma 
Cc: Eric Dumazet 
Cc: Fenghua Yu 
Cc: Frederic Weisbecker 
Cc: Greg Kroah-Hartman 
Cc: H. Peter Anvin 
Cc: Haavard Skinnemoen 
Cc: Hans-Christian Egtvedt 
Cc: Hedi Berriche 
Cc: Heiko Carstens 
Cc: Helge Deller 
Cc: Ivan Kokshaysky 
Cc: James Hogan 
Cc: Jens Axboe 
Cc: John Stultz 
Cc: Martin Schwidefsky 
Cc: Masami Hiramatsu 
Cc: Matt Turner 
Cc: Mike Frysinger 
Cc: Mike Travis 
Cc: Neil Brown 
Cc: Nicolas Pitre 
Cc: Paul Mackerras 
Cc: Paul Mundt 
Cc: Rafael J. Wysocki 
Cc: Ralf Baechle 
Cc: Richard Henderson 
Cc: Robert Richter 
Cc: Russell King 
Cc: Russell King 
Cc: Rusty Russell 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: Tony Luck 
Cc: Will Deacon 
Cc: Wim Van Sebroeck 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

x86: always define BUG() and HAVE_ARCH_BUG, even with !CONFIG_BUG

2014-04-07T23:36:10+00:00

This ensures that BUG() always has a definition that causes a trap (via
an undefined instruction), and that the compiler still recognizes the
code following BUG() as unreachable, avoiding warnings that would
otherwise appear (such as on non-void functions that don't return a
value after BUG()).

In addition to saving a few bytes over the generic infinite-loop
implementation, this implementation traps rather than looping, which
potentially allows for better error-recovery behavior (such as by
rebooting).

Signed-off-by: Josh Triplett 
Reported-by: Arnd Bergmann 
Acked-by: Arnd Bergmann 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: "H. Peter Anvin" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds