linux-stable.git/include/linux/mmzone.h, branch v3.16.67

KAISER: Kernel Address Isolation

2018-01-09T00:35:13+00:00

This patch introduces our implementation of KAISER (Kernel Address Isolation to
have Side-channels Efficiently Removed), a kernel isolation technique to close
hardware side channels on kernel address information.

More information about the patch can be found on:

        https://github.com/IAIK/KAISER

From: Richard Fellner 
From: Daniel Gruss 
Subject: [RFC, PATCH] x86_64: KAISER - do not map kernel in user mode
Date: Thu, 4 May 2017 14:26:50 +0200
Link: http://marc.info/?l=linux-kernel&m=149390087310405&w=2
Kaiser-4.10-SHA1: c4b1831d44c6144d3762ccc72f0c4e71a0c713e5

To: 
To: 
Cc: 
Cc: 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Ingo Molnar 
Cc: 
Cc: 

After several recent works [1,2,3] KASLR on x86_64 was basically
considered dead by many researchers. We have been working on an
efficient but effective fix for this problem and found that not mapping
the kernel space when running in user mode is the solution to this
problem [4] (the corresponding paper [5] will be presented at ESSoS17).

With this RFC patch we allow anybody to configure their kernel with the
flag CONFIG_KAISER to add our defense mechanism.

If there are any questions we would love to answer them.
We also appreciate any comments!

Cheers,
Daniel (+ the KAISER team from Graz University of Technology)

[1] http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
[2] https://www.blackhat.com/docs/us-16/materials/us-16-Fogh-Using-Undocumented-CPU-Behaviour-To-See-Into-Kernel-Mode-And-Break-KASLR-In-The-Process.pdf
[3] https://www.blackhat.com/docs/us-16/materials/us-16-Jang-Breaking-Kernel-Address-Space-Layout-Randomization-KASLR-With-Intel-TSX.pdf
[4] https://github.com/IAIK/KAISER
[5] https://gruss.cc/files/kaiser.pdf

(cherry picked from Change-Id: I0eb000c33290af01fc4454ca0c701d00f1d30b1d)

Conflicts:
arch/x86/entry/entry_64.S (not in this tree)
arch/x86/kernel/entry_64.S (patched instead of that)
arch/x86/entry/entry_64_compat.S (not in this tree)
arch/x86/ia32/ia32entry.S (patched instead of that)
arch/x86/include/asm/hw_irq.h
arch/x86/include/asm/pgtable_types.h
arch/x86/include/asm/processor.h
arch/x86/kernel/irqinit.c
arch/x86/kernel/process.c
arch/x86/mm/Makefile
arch/x86/mm/pgtable.c
init/main.c

Signed-off-by: Hugh Dickins 
[bwh: Folded in the follow-up patches from Hugh:
 - kaiser: merged update
 - kaiser: do not set _PAGE_NX on pgd_none
 - kaiser: stack map PAGE_SIZE at THREAD_SIZE-PAGE_SIZE
 - kaiser: fix build and FIXME in alloc_ldt_struct()
 - kaiser: KAISER depends on SMP
 - kaiser: fix regs to do_nmi() ifndef CONFIG_KAISER
 - kaiser: fix perf crashes
 - kaiser: ENOMEM if kaiser_pagetable_walk() NULL
 - kaiser: tidied up asm/kaiser.h somewhat
 - kaiser: tidied up kaiser_add/remove_mapping slightly
 - kaiser: kaiser_remove_mapping() move along the pgd
 - kaiser: align addition to x86/mm/Makefile
 - kaiser: cleanups while trying for gold link
 - kaiser: name that 0x1000 KAISER_SHADOW_PGD_OFFSET
 - kaiser: delete KAISER_REAL_SWITCH option
 - kaiser: vmstat show NR_KAISERTABLE as nr_overhead
 - kaiser: enhanced by kernel and user PCIDs
 - kaiser: load_new_mm_cr3() let SWITCH_USER_CR3 flush user
 - kaiser: PCID 0 for kernel and 128 for user
 - kaiser: x86_cr3_pcid_noflush and x86_cr3_pcid_user
 - kaiser: paranoid_entry pass cr3 need to paranoid_exit
 - kaiser: _pgd_alloc() without __GFP_REPEAT to avoid stalls
 - kaiser: fix unlikely error in alloc_ldt_struct()
 - kaiser: drop is_atomic arg to kaiser_pagetable_walk()
 Backported to 3.16:
 - Add missing #include in arch/x86/mm/kaiser.c
 - Use variable PEBS buffer size since we have "perf/x86/intel: Use PAGE_SIZE
   for PEBS buffer size on Core2"
 - Renumber X86_FEATURE_INVPCID_SINGLE to avoid collision
 - Adjust context]
Signed-off-by: Ben Hutchings

mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function

2018-01-01T20:51:48+00:00

commit 1dd2bfc86818ddbc95f98e312e7704350223fd7d upstream.

pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
pfn_to_section_nr() has no issue even if it is defined as macro.  But
section_nr_to_pfn() has overflow issue if sec is defined as int.

section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT.  If sec is
defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
value.

__remove_section() calculates start_pfn using section_nr_to_pfn() and
scn_nr defined as int.  So if hot-removed memory address is over 16TB,
overflow issue occurs and section_nr_to_pfn() does not calculate correct
pfn.

To make callers use proper arg, the patch changes the macros to inline
functions.

Fixes: 815121d2b5cd ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.com
Signed-off-by: Yasuaki Ishimatsu 
Acked-by: Michal Hocko 
Cc: Xishi Qiu 
Cc: Reza Arbab 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

mm: page_alloc: use unsigned int for order in more places

2014-06-04T23:54:09+00:00

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page order.
This converts a number of sites in mm/page_alloc.c to use unsigned int for
order where possible.

Signed-off-by: Mel Gorman 
Acked-by: Rik van Riel 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Theodore Ts'o 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_alloc: reduce number of times page_to_pfn is called

2014-06-04T23:54:09+00:00

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman 
Acked-by: Rik van Riel 
Cc: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Theodore Ts'o 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_alloc: use word-based accesses for get/set pageblock bitmaps

2014-06-04T23:54:09+00:00

The test_bit operations in get/set pageblock flags are expensive.  This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest.  Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

In addition to the performance benefits, this patch closes races that are
possible between:

a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
   reads part of the bits before and other part of the bits after
   set_pageblock_migratetype() has updated them.

b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
   read-modify-update set bit operation in set_pageblock_skip() will cause
   lost updates to some bits changed in the set_pageblock_migratetype().

Joonsoo Kim first reported the case a) via code inspection.  Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock).  Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g.  list_move().  Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.

That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove.  Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.

After this patch, the race has no longer been observed in testing.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Reported-by: Joonsoo Kim 
Reported-and-tested-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Jan Kara 
Cc: Michal Hocko 
Cc: Hugh Dickins 
Cc: Dave Hansen 
Cc: Theodore Ts'o 
Cc: "Paul E. McKenney" 
Cc: Oleg Nesterov 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, compaction: add per-zone migration pfn cache for async compaction

2014-06-04T23:54:06+00:00

Each zone has a cached migration scanner pfn for memory compaction so that
subsequent calls to memory compaction can start where the previous call
left off.

Currently, the compaction migration scanner only updates the per-zone
cached pfn when pageblocks were not skipped for async compaction.  This
creates a dependency on calling sync compaction to avoid having subsequent
calls to async compaction from scanning an enormous amount of non-MOVABLE
pageblocks each time it is called.  On large machines, this could be
potentially very expensive.

This patch adds a per-zone cached migration scanner pfn only for async
compaction.  It is updated everytime a pageblock has been scanned in its
entirety and when no pages from it were successfully isolated.  The cached
migration scanner pfn for sync compaction is updated only when called for
sync compaction.

Signed-off-by: David Rientjes 
Acked-by: Vlastimil Babka 
Reviewed-by: Naoya Horiguchi 
Cc: Greg Thelen 
Cc: Mel Gorman 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mem-hotplug: implement get/put_online_mems

2014-06-04T23:53:59+00:00

kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug.  To protect against cpu hotplug, these functions use
{get,put}_online_cpus.  However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.

What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex.  As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus.  That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.

[ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
  myself, because it used an rw semaphore for get/put_online_mems,
  making them dead lock prune.  ]

This patch (of 2):

{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently.  Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.

This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e.  executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.

lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

Signed-off-by: Vladimir Davydov 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: Tang Chen 
Cc: Zhang Yanfei 
Cc: Toshi Kani 
Cc: Xishi Qiu 
Cc: Jiang Liu 
Cc: Rafael J. Wysocki 
Cc: David Rientjes 
Cc: Wen Congyang 
Cc: Yasuaki Ishimatsu 
Cc: Lai Jiangshan 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: page_alloc: do not cache reclaim distances

2014-06-04T23:53:59+00:00

pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
by zone_reclaim due to its distance.  As it is expected that
zone_reclaim_mode will be rarely enabled it is unreasonable for all
machines to take a penalty.  Fortunately, the zone_reclaim_mode() path
is already slow and it is the path that takes the hit.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Reviewed-by: Zhang Yanfei 
Acked-by: Michal Hocko 
Reviewed-by: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: keep page cache radix tree nodes in check

2014-04-03T23:21:01+00:00

Previously, page cache radix tree nodes were freed after reclaim emptied
out their page pointers.  But now reclaim stores shadow entries in their
place, which are only reclaimed when the inodes themselves are
reclaimed.  This is problematic for bigger files that are still in use
after they have a significant amount of their cache reclaimed, without
any of those pages actually refaulting.  The shadow entries will just
sit there and waste memory.  In the worst case, the shadow entries will
accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
rather than global because we expect the radix tree nodes themselves to
be allocated node-locally and we want to reduce cross-node references of
otherwise independent cache workloads.  A simple shrinker will then
reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

[akpm@linux-foundation.org: export the right function]
Signed-off-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Reviewed-by: Minchan Kim 
Cc: Andrea Arcangeli 
Cc: Bob Liu 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Greg Thelen 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: KOSAKI Motohiro 
Cc: Luigi Semenzato 
Cc: Mel Gorman 
Cc: Metin Doslu 
Cc: Michel Lespinasse 
Cc: Ozgun Erdogan 
Cc: Peter Zijlstra 
Cc: Roman Gushchin 
Cc: Ryan Mallon 
Cc: Tejun Heo 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: thrash detection-based file cache sizing

2014-04-03T23:21:01+00:00

The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have shown
to benefit from caching in the past.  We call the recently usedbut
ultimately was not significantly better than a FIFO policy and still
thrashed cache based on eviction speed, rather than actual demand for
cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size.  By maintaining
a history of recently evicted file pages it can detect frequently used
pages with an arbitrarily small inactive list size, and subsequently
apply pressure on the active list based on actual demand for cache, not
just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot.  On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list.  And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Reviewed-by: Minchan Kim 
Reviewed-by: Bob Liu 
Cc: Andrea Arcangeli 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Greg Thelen 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: KOSAKI Motohiro 
Cc: Luigi Semenzato 
Cc: Mel Gorman 
Cc: Metin Doslu 
Cc: Michel Lespinasse 
Cc: Ozgun Erdogan 
Cc: Peter Zijlstra 
Cc: Roman Gushchin 
Cc: Ryan Mallon 
Cc: Tejun Heo 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds