linux.git/arch/x86/lib, branch v2.6.37

x86, mem: Optimize memmove for small size and unaligned cases

2010-09-25T01:57:11+00:00

movs instruction will combine data to accelerate moving data,
however we need to concern two cases about it.

1. movs instruction need long lantency to startup,
   so here we use general mov instruction to copy data.
2. movs instruction is not good for unaligned case,
   even if src offset is 0x10, dest offset is 0x0,
   we avoid and handle the case by general mov instruction.

Signed-off-by: Ma Ling 
LKML-Reference: <1284664360-6138-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin

x86, mem: Optimize memcpy by avoiding memory false dependece

2010-08-23T21:56:41+00:00

All read operations after allocation stage can run speculatively,
all write operation will run in program order, and if addresses are
different read may run before older write operation, otherwise wait
until write commit. However CPU don't check each address bit,
so read could fail to recognize different address even they
are in different page.For example if rsi is 0xf004, rdi is 0xe008,
in following operation there will generate big performance latency.
1. movq (%rsi),	%rax
2. movq %rax,	(%rdi)
3. movq 8(%rsi), %rax
4. movq %rax,	8(%rdi)

If %rsi and rdi were in really the same meory page, there are TRUE
read-after-write dependence because instruction 2 write 0x008 and
instruction 3 read 0x00c, the two address are overlap partially.
Actually there are in different page and no any issues,
but without checking each address bit CPU could think they are
in the same page, and instruction 3 have to wait for instruction 2
to write data into cache from write buffer, then load data from cache,
the cost time read spent is equal to mfence instruction. We may avoid it by
tuning operation sequence as follow.

1. movq 8(%rsi), %rax
2. movq %rax,	8(%rdi)
3. movq (%rsi),	%rax
4. movq %rax,	(%rdi)

Instruction 3 read 0x004, instruction 2 write address 0x010, no any
dependence.  At last on Core2 we gain 1.83x speedup compared with
original instruction sequence.  In this patch we first handle small
size(less 20bytes), then jump to different copy mode. Based on our
micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
improvement, and up to 1.5X improvement for 1024 bytes on Corei7.  (We
use our micro-benchmark, and will do further test according to your
requirment)

Signed-off-by: Ma Ling 
LKML-Reference: <1277753065-18610-1-git-send-email-ling.ma@intel.com>
Signed-off-by: H. Peter Anvin

x86, mem: Don't implement forward memmove() as memcpy()

2010-08-23T21:14:27+00:00

memmove() allow source and destination address to be overlap, but
there is no such limitation for memcpy().  Therefore, explicitly
implement memmove() in both the forwards and backward directions, to
give us the ability to optimize memcpy().

Signed-off-by: Ma Ling 
LKML-Reference: 
Signed-off-by: H. Peter Anvin

Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2010-08-13T17:35:48+00:00

* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, asm: Use a lower case name for the end macro in atomic64_386_32.S
  x86, asm: Refactor atomic64_386_32.S to support old binutils and be cleaner
  x86: Document __phys_reloc_hide() usage in __pa_symbol()
  x86, apic: Map the local apic when parsing the MP table.

x86, asm: Use a lower case name for the end macro in atomic64_386_32.S

2010-08-12T14:04:16+00:00

Use a lowercase name for the end macro, which somehow fixes a binutils 2.16
problem.

Signed-off-by: Luca Barbieri 
LKML-Reference: 
Signed-off-by: H. Peter Anvin

x86, asm: Refactor atomic64_386_32.S to support old binutils and be cleaner

2010-08-12T04:03:28+00:00

The old code didn't work on binutils 2.12 because setting a symbol to
a register apparently requires a fairly recent version.

This commit refactors the code to use the C preprocessor instead, and
in the process makes the whole code a bit easier to understand.

The object code produced is unchanged as expected.

This fixes kernel bugzilla 16506.

Reported-by: Dieter Stussy 
Signed-off-by: Luca Barbieri 
Signed-off-by: H. Peter Anvin 
Cc:  2.6.35
LKML-Reference:

Merge branch 'x86-alternatives-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

2010-08-06T23:24:17+00:00

* 'x86-alternatives-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, alternatives: BUG on encountering an invalid CPU feature number
  x86, alternatives: Fix one more open-coded 8-bit alternative number
  x86, alternatives: Use 16-bit numbers for cpufeature index

x86, asm: Merge cmpxchg_486_u64() and cmpxchg8b_emu()

2010-07-29T00:05:11+00:00

We have two functions for doing exactly the same thing -- emulating
cmpxchg8b on 486 and older hardware -- with different calling
conventions, and yet doing the same thing.  Drop the C version and use
the assembly version, via alternatives, for both the local and
non-local versions of cmpxchg8b.

Signed-off-by: H. Peter Anvin 
LKML-Reference:

x86, asm: Move cmpxchg emulation code to arch/x86/lib

2010-07-28T23:53:49+00:00

Move cmpxchg emulation code from arch/x86/kernel/cpu (which is
otherwise CPU identification) to arch/x86/lib, where other emulation
code lives already.

Signed-off-by: H. Peter Anvin 
LKML-Reference:

x86, alternatives: Fix one more open-coded 8-bit alternative number

2010-07-13T21:56:16+00:00

Fix a missing case of an 8-bit alternative number, buried inside an
assembly macro.

Signed-off-by: H. Peter Anvin 
Reported-by: Yinghai Lu 
Cc: Suresh Siddha 
LKML-Reference: <4C3BDDA3.2060900@kernel.org>