<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/arch/mips/include/asm/cpu-features.h, branch v4.19</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>MIPS: Hardcode cpu_has_* where known at compile time due to ISA</title>
<updated>2018-07-24T21:09:13+00:00</updated>
<author>
<name>Paul Burton</name>
<email>paul.burton@mips.com</email>
</author>
<published>2017-06-12T18:54:23+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=93e01942a6eb1ce730b42158d0d6c6531fe29ca0'/>
<id>93e01942a6eb1ce730b42158d0d6c6531fe29ca0</id>
<content type='text'>
Many architectural features have over time moved from being optional to
either be required or removed by newer architecture releases. This means
that in many cases we can know at compile time whether a feature will be
supported or not purely due to the knowledge we have about the ISA the
kernel build is targeting.

This patch introduces a bunch of utility macros for checking for
supported options, ASEs &amp; combinations of those with ISA revisions. It
then makes use of these in the default definitions of cpu_has_* macros.
The result is that many of the macros become compile-time constant,
allowing more optimisation opportunities for the compiler - particularly
with kernels built for later ISA revisions.

To demonstrate the effect of this patch, the following table shows the
size in bytes of the kernel binary as reported by scripts/bloat-o-meter
for v4.12-rc4 maltasmvp_defconfig kernels with &amp; without this patch. A
variant of maltasmvp_defconfig with CONFIG_CPU_MIPS32_R6 selected is
also shown, to demonstrate that MIPSr6 systems benefit more due to extra
features becoming required by that architecture revision. Builds of
pistachio_defconfig are also shown, as although this is a MIPSr2
platform it doesn't hardcode any features in a machine-specific
cpu-feature-overrides.h, which allows it to gain more from this patch
than the equivalent Malta r2 build.

     Config         | Before  | After   |  Change
    ----------------|---------|---------|---------
     maltasmvp      | 7248316 | 7247714 |    -602
     maltasmvp + r6 | 6955595 | 6950777 |   -4818
     pistachio      | 8650977 | 8363898 | -287079

Signed-off-by: Paul Burton &lt;paul.burton@mips.com&gt;
Patchwork: https://patchwork.linux-mips.org/patch/16360/
Cc: Joshua Kinard &lt;kumba@gentoo.org&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: linux-mips@linux-mips.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Many architectural features have over time moved from being optional to
either be required or removed by newer architecture releases. This means
that in many cases we can know at compile time whether a feature will be
supported or not purely due to the knowledge we have about the ISA the
kernel build is targeting.

This patch introduces a bunch of utility macros for checking for
supported options, ASEs &amp; combinations of those with ISA revisions. It
then makes use of these in the default definitions of cpu_has_* macros.
The result is that many of the macros become compile-time constant,
allowing more optimisation opportunities for the compiler - particularly
with kernels built for later ISA revisions.

To demonstrate the effect of this patch, the following table shows the
size in bytes of the kernel binary as reported by scripts/bloat-o-meter
for v4.12-rc4 maltasmvp_defconfig kernels with &amp; without this patch. A
variant of maltasmvp_defconfig with CONFIG_CPU_MIPS32_R6 selected is
also shown, to demonstrate that MIPSr6 systems benefit more due to extra
features becoming required by that architecture revision. Builds of
pistachio_defconfig are also shown, as although this is a MIPSr2
platform it doesn't hardcode any features in a machine-specific
cpu-feature-overrides.h, which allows it to gain more from this patch
than the equivalent Malta r2 build.

     Config         | Before  | After   |  Change
    ----------------|---------|---------|---------
     maltasmvp      | 7248316 | 7247714 |    -602
     maltasmvp + r6 | 6955595 | 6950777 |   -4818
     pistachio      | 8650977 | 8363898 | -287079

Signed-off-by: Paul Burton &lt;paul.burton@mips.com&gt;
Patchwork: https://patchwork.linux-mips.org/patch/16360/
Cc: Joshua Kinard &lt;kumba@gentoo.org&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: linux-mips@linux-mips.org
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: perf: More robustly probe for the presence of per-tc counters</title>
<updated>2018-05-15T14:16:16+00:00</updated>
<author>
<name>Matt Redfearn</name>
<email>matt.redfearn@mips.com</email>
</author>
<published>2018-04-20T10:23:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=800fb71281ca2ed5c8a7299e10ebc0de2f61cdda'/>
<id>800fb71281ca2ed5c8a7299e10ebc0de2f61cdda</id>
<content type='text'>
The presence of per TC performance counters is now detected by
cpu-probe.c and indicated by MIPS_CPU_MT_PER_TC_PERF_COUNTERS in
cpu_data. Switch detection of the feature to use this new flag rather
than blindly testing the implementation specific config7 register with a
magic number.

Signed-off-by: Matt Redfearn &lt;matt.redfearn@mips.com&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Florian Fainelli &lt;f.fainelli@gmail.com&gt;
Cc: Maciej W. Rozycki &lt;macro@mips.com&gt;
Cc: Paul Burton &lt;paul.burton@mips.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Robert Richter &lt;rric@kernel.org&gt;
Cc: linux-mips@linux-mips.org
Cc: oprofile-list@lists.sf.net
Patchwork: https://patchwork.linux-mips.org/patch/19142/
Signed-off-by: James Hogan &lt;jhogan@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The presence of per TC performance counters is now detected by
cpu-probe.c and indicated by MIPS_CPU_MT_PER_TC_PERF_COUNTERS in
cpu_data. Switch detection of the feature to use this new flag rather
than blindly testing the implementation specific config7 register with a
magic number.

Signed-off-by: Matt Redfearn &lt;matt.redfearn@mips.com&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Florian Fainelli &lt;f.fainelli@gmail.com&gt;
Cc: Maciej W. Rozycki &lt;macro@mips.com&gt;
Cc: Paul Burton &lt;paul.burton@mips.com&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Cc: Alexander Shishkin &lt;alexander.shishkin@linux.intel.com&gt;
Cc: Jiri Olsa &lt;jolsa@redhat.com&gt;
Cc: Namhyung Kim &lt;namhyung@kernel.org&gt;
Cc: Robert Richter &lt;rric@kernel.org&gt;
Cc: linux-mips@linux-mips.org
Cc: oprofile-list@lists.sf.net
Patchwork: https://patchwork.linux-mips.org/patch/19142/
Signed-off-by: James Hogan &lt;jhogan@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: cpu-features.h: Replace __mips_isa_rev with MIPS_ISA_REV</title>
<updated>2018-03-09T11:22:46+00:00</updated>
<author>
<name>Matt Redfearn</name>
<email>matt.redfearn@mips.com</email>
</author>
<published>2018-02-26T17:02:43+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=18ba210a29d08ea96025cb9d19c2eebf65846330'/>
<id>18ba210a29d08ea96025cb9d19c2eebf65846330</id>
<content type='text'>
Remove the need to check that __mips_isa_rev is defined by using the
newly added MIPS_ISA_REV.

Signed-off-by: Matt Redfearn &lt;matt.redfearn@mips.com&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paul Burton &lt;paul.burton@mips.com&gt;
Cc: "Maciej W. Rozycki" &lt;macro@mips.com&gt;
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/18675/
Signed-off-by: James Hogan &lt;jhogan@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Remove the need to check that __mips_isa_rev is defined by using the
newly added MIPS_ISA_REV.

Signed-off-by: Matt Redfearn &lt;matt.redfearn@mips.com&gt;
Cc: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paul Burton &lt;paul.burton@mips.com&gt;
Cc: "Maciej W. Rozycki" &lt;macro@mips.com&gt;
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/18675/
Signed-off-by: James Hogan &lt;jhogan@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: Introduce cpu_tcache_line_size</title>
<updated>2017-08-07T22:02:27+00:00</updated>
<author>
<name>Matt Redfearn</name>
<email>matt.redfearn@imgtec.com</email>
</author>
<published>2017-07-26T07:41:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=21da5332327b6d183bd93336ecf29c70bc609b7b'/>
<id>21da5332327b6d183bd93336ecf29c70bc609b7b</id>
<content type='text'>
There exist macros to return the cache line size of the L1 dcache and L2
scache but there is currently no macro for the L3 tcache. Add this macro
which will be used by the following patch "MIPS: PCI: Fix
smp_processor_id() in preemptible"

Signed-off-by: Matt Redfearn &lt;matt.redfearn@imgtec.com&gt;
Cc: Maciej W. Rozycki &lt;macro@imgtec.com&gt;
Cc: James Hogan &lt;james.hogan@imgtec.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/16871/
Signed-off-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There exist macros to return the cache line size of the L1 dcache and L2
scache but there is currently no macro for the L3 tcache. Add this macro
which will be used by the following patch "MIPS: PCI: Fix
smp_processor_id() in preemptible"

Signed-off-by: Matt Redfearn &lt;matt.redfearn@imgtec.com&gt;
Cc: Maciej W. Rozycki &lt;macro@imgtec.com&gt;
Cc: James Hogan &lt;james.hogan@imgtec.com&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/16871/
Signed-off-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: MIPS16e2: Identify ASE presence</title>
<updated>2017-07-05T12:06:44+00:00</updated>
<author>
<name>Maciej W. Rozycki</name>
<email>macro@imgtec.com</email>
</author>
<published>2017-05-23T12:37:05+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8d1630f13754f1435d3ea7078829121c52f38d15'/>
<id>8d1630f13754f1435d3ea7078829121c52f38d15</id>
<content type='text'>
Identify the presence of the MIPS16e2 ASE as per the architecture
specification[1], by checking for CP0 Config5.CA2 bit being 1[2].

References:

[1] "MIPS32 Architecture for Programmers: MIPS16e2 Application-Specific
    Extension Technical Reference Manual", Imagination Technologies
    Ltd., Document Number: MD01172, Revision 01.00, April 26, 2016,
    Section 1.2 "Software Detection of the ASE", p. 5

[2] "MIPS32 interAptiv Multiprocessing System Software User's Manual",
    Imagination Technologies Ltd., Document Number: MD00904, Revision
    02.01, June 15, 2016, Section 2.2.1.6 "Device Configuration 5 --
    Config5 (CP0 Register 16, Select 5)", pp. 71-72

Signed-off-by: Maciej W. Rozycki &lt;macro@imgtec.com&gt;
Reviewed-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16094/
Signed-off-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Identify the presence of the MIPS16e2 ASE as per the architecture
specification[1], by checking for CP0 Config5.CA2 bit being 1[2].

References:

[1] "MIPS32 Architecture for Programmers: MIPS16e2 Application-Specific
    Extension Technical Reference Manual", Imagination Technologies
    Ltd., Document Number: MD01172, Revision 01.00, April 26, 2016,
    Section 1.2 "Software Detection of the ASE", p. 5

[2] "MIPS32 interAptiv Multiprocessing System Software User's Manual",
    Imagination Technologies Ltd., Document Number: MD00904, Revision
    02.01, June 15, 2016, Section 2.2.1.6 "Device Configuration 5 --
    Config5 (CP0 Register 16, Select 5)", pp. 71-72

Signed-off-by: Maciej W. Rozycki &lt;macro@imgtec.com&gt;
Reviewed-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16094/
Signed-off-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: Add CPU shared FTLB feature detection</title>
<updated>2017-06-29T00:42:29+00:00</updated>
<author>
<name>Paul Burton</name>
<email>paul.burton@imgtec.com</email>
</author>
<published>2017-06-02T22:38:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=e7bc8557428f069eaa613b3676ea6931c0f7fe43'/>
<id>e7bc8557428f069eaa613b3676ea6931c0f7fe43</id>
<content type='text'>
Some systems share FTLB RAMs or entries between sibling CPUs (ie.
hardware threads, or VP(E)s, within a core). These properties require
kernel handling in various places. As a start this patch introduces
cpu_has_shared_ftlb_ram &amp; cpu_has_shared_ftlb_entries feature macros
which we set appropriately for I6400 &amp; I6500 CPUs. Further patches will
make use of these macros as appropriate.

Signed-off-by: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16202/
Signed-off-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Some systems share FTLB RAMs or entries between sibling CPUs (ie.
hardware threads, or VP(E)s, within a core). These properties require
kernel handling in various places. As a start this patch introduces
cpu_has_shared_ftlb_ram &amp; cpu_has_shared_ftlb_entries feature macros
which we set appropriately for I6400 &amp; I6500 CPUs. Further patches will
make use of these macros as appropriate.

Signed-off-by: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16202/
Signed-off-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: Probe guest MVH</title>
<updated>2017-03-28T13:49:15+00:00</updated>
<author>
<name>James Hogan</name>
<email>james.hogan@imgtec.com</email>
</author>
<published>2017-03-14T10:15:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a929bdc52be658fd622c2d0f9fdf1d333aea34fd'/>
<id>a929bdc52be658fd622c2d0f9fdf1d333aea34fd</id>
<content type='text'>
Probe for availablility of M{T,F}HC0 instructions used with e.g. XPA in
the VZ guest context, and make it available via cpu_guest_has_mvh. This
will be helpful in properly emulating the MAAR registers in KVM for MIPS
VZ.

Signed-off-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Acked-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: "Radim Krčmář" &lt;rkrcmar@redhat.com&gt;
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Probe for availablility of M{T,F}HC0 instructions used with e.g. XPA in
the VZ guest context, and make it available via cpu_guest_has_mvh. This
will be helpful in properly emulating the MAAR registers in KVM for MIPS
VZ.

Signed-off-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Acked-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: "Radim Krčmář" &lt;rkrcmar@redhat.com&gt;
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: Probe guest CP0_UserLocal</title>
<updated>2017-03-28T13:49:11+00:00</updated>
<author>
<name>James Hogan</name>
<email>james.hogan@imgtec.com</email>
</author>
<published>2017-03-14T10:15:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a7c7ad6c3e0a2ae88625df4e41c1c7305aa26f04'/>
<id>a7c7ad6c3e0a2ae88625df4e41c1c7305aa26f04</id>
<content type='text'>
Probe for presence of guest CP0_UserLocal register and expose via
cpu_guest_has_userlocal. This register is optional pre-r6, so this will
allow KVM to only save/restore/expose the guest CP0_UserLocal register
if it exists.

Signed-off-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Acked-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: "Radim Krčmář" &lt;rkrcmar@redhat.com&gt;
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Probe for presence of guest CP0_UserLocal register and expose via
cpu_guest_has_userlocal. This register is optional pre-r6, so this will
allow KVM to only save/restore/expose the guest CP0_UserLocal register
if it exists.

Signed-off-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Acked-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: "Radim Krčmář" &lt;rkrcmar@redhat.com&gt;
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
</pre>
</div>
</content>
</entry>
<entry>
<title>MIPS: Add defs &amp; probing of UFR</title>
<updated>2017-03-28T13:48:53+00:00</updated>
<author>
<name>James Hogan</name>
<email>james.hogan@imgtec.com</email>
</author>
<published>2017-03-14T10:15:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4e87580e6d144e44820a0d23e756136b6218f4f0'/>
<id>4e87580e6d144e44820a0d23e756136b6218f4f0</id>
<content type='text'>
Add definitions and probing of the UFR bit in Config5. This bit allows
user mode control of the FR bit (floating point register mode). It is
present if the UFRP bit is set in the floating point implementation
register.

This is a capability KVM may want to expose to guest kernels, even
though Linux is unlikely to ever use it due to the implications for
multi-threaded programs.

Signed-off-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Acked-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: "Radim Krčmář" &lt;rkrcmar@redhat.com&gt;
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Add definitions and probing of the UFR bit in Config5. This bit allows
user mode control of the FR bit (floating point register mode). It is
present if the UFRP bit is set in the floating point implementation
register.

This is a capability KVM may want to expose to guest kernels, even
though Linux is unlikely to ever use it due to the implications for
multi-threaded programs.

Signed-off-by: James Hogan &lt;james.hogan@imgtec.com&gt;
Acked-by: Ralf Baechle &lt;ralf@linux-mips.org&gt;
Cc: Paul Burton &lt;paul.burton@imgtec.com&gt;
Cc: Paolo Bonzini &lt;pbonzini@redhat.com&gt;
Cc: "Radim Krčmář" &lt;rkrcmar@redhat.com&gt;
Cc: linux-mips@linux-mips.org
Cc: kvm@vger.kernel.org
</pre>
</div>
</content>
</entry>
<entry>
<title>lib/GCD.c: use binary GCD algorithm instead of Euclidean</title>
<updated>2016-05-21T00:58:30+00:00</updated>
<author>
<name>Zhaoxiu Zeng</name>
<email>zhaoxiu.zeng@gmail.com</email>
</author>
<published>2016-05-21T00:03:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=fff7fb0b2d908dec779783d8eaf3d7725230f75e'/>
<id>fff7fb0b2d908dec779783d8eaf3d7725230f75e</id>
<content type='text'>
The binary GCD algorithm is based on the following facts:
	1. If a and b are all evens, then gcd(a,b) = 2 * gcd(a/2, b/2)
	2. If a is even and b is odd, then gcd(a,b) = gcd(a/2, b)
	3. If a and b are all odds, then gcd(a,b) = gcd((a-b)/2, b) = gcd((a+b)/2, b)

Even on x86 machines with reasonable division hardware, the binary
algorithm runs about 25% faster (80% the execution time) than the
division-based Euclidian algorithm.

On platforms like Alpha and ARMv6 where division is a function call to
emulation code, it's even more significant.

There are two variants of the code here, depending on whether a fast
__ffs (find least significant set bit) instruction is available.  This
allows the unpredictable branches in the bit-at-a-time shifting loop to
be eliminated.

If fast __ffs is not available, the "even/odd" GCD variant is used.

I use the following code to benchmark:

	#include &lt;stdio.h&gt;
	#include &lt;stdlib.h&gt;
	#include &lt;stdint.h&gt;
	#include &lt;string.h&gt;
	#include &lt;time.h&gt;
	#include &lt;unistd.h&gt;

	#define swap(a, b) \
		do { \
			a ^= b; \
			b ^= a; \
			a ^= b; \
		} while (0)

	unsigned long gcd0(unsigned long a, unsigned long b)
	{
		unsigned long r;

		if (a &lt; b) {
			swap(a, b);
		}

		if (b == 0)
			return a;

		while ((r = a % b) != 0) {
			a = b;
			b = r;
		}

		return b;
	}

	unsigned long gcd1(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		b &gt;&gt;= __builtin_ctzl(b);

		for (;;) {
			a &gt;&gt;= __builtin_ctzl(a);
			if (a == b)
				return a &lt;&lt; __builtin_ctzl(r);

			if (a &lt; b)
				swap(a, b);
			a -= b;
		}
	}

	unsigned long gcd2(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		r &amp;= -r;

		while (!(b &amp; r))
			b &gt;&gt;= 1;

		for (;;) {
			while (!(a &amp; r))
				a &gt;&gt;= 1;
			if (a == b)
				return a;

			if (a &lt; b)
				swap(a, b);
			a -= b;
			a &gt;&gt;= 1;
			if (a &amp; r)
				a += b;
			a &gt;&gt;= 1;
		}
	}

	unsigned long gcd3(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		b &gt;&gt;= __builtin_ctzl(b);
		if (b == 1)
			return r &amp; -r;

		for (;;) {
			a &gt;&gt;= __builtin_ctzl(a);
			if (a == 1)
				return r &amp; -r;
			if (a == b)
				return a &lt;&lt; __builtin_ctzl(r);

			if (a &lt; b)
				swap(a, b);
			a -= b;
		}
	}

	unsigned long gcd4(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		r &amp;= -r;

		while (!(b &amp; r))
			b &gt;&gt;= 1;
		if (b == r)
			return r;

		for (;;) {
			while (!(a &amp; r))
				a &gt;&gt;= 1;
			if (a == r)
				return r;
			if (a == b)
				return a;

			if (a &lt; b)
				swap(a, b);
			a -= b;
			a &gt;&gt;= 1;
			if (a &amp; r)
				a += b;
			a &gt;&gt;= 1;
		}
	}

	static unsigned long (*gcd_func[])(unsigned long a, unsigned long b) = {
		gcd0, gcd1, gcd2, gcd3, gcd4,
	};

	#define TEST_ENTRIES (sizeof(gcd_func) / sizeof(gcd_func[0]))

	#if defined(__x86_64__)

	#define rdtscll(val) do { \
		unsigned long __a,__d; \
		__asm__ __volatile__("rdtsc" : "=a" (__a), "=d" (__d)); \
		(val) = ((unsigned long long)__a) | (((unsigned long long)__d)&lt;&lt;32); \
	} while(0)

	static unsigned long long benchmark_gcd_func(unsigned long (*gcd)(unsigned long, unsigned long),
								unsigned long a, unsigned long b, unsigned long *res)
	{
		unsigned long long start, end;
		unsigned long long ret;
		unsigned long gcd_res;

		rdtscll(start);
		gcd_res = gcd(a, b);
		rdtscll(end);

		if (end &gt;= start)
			ret = end - start;
		else
			ret = ~0ULL - start + 1 + end;

		*res = gcd_res;
		return ret;
	}

	#else

	static inline struct timespec read_time(void)
	{
		struct timespec time;
		clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;time);
		return time;
	}

	static inline unsigned long long diff_time(struct timespec start, struct timespec end)
	{
		struct timespec temp;

		if ((end.tv_nsec - start.tv_nsec) &lt; 0) {
			temp.tv_sec = end.tv_sec - start.tv_sec - 1;
			temp.tv_nsec = 1000000000ULL + end.tv_nsec - start.tv_nsec;
		} else {
			temp.tv_sec = end.tv_sec - start.tv_sec;
			temp.tv_nsec = end.tv_nsec - start.tv_nsec;
		}

		return temp.tv_sec * 1000000000ULL + temp.tv_nsec;
	}

	static unsigned long long benchmark_gcd_func(unsigned long (*gcd)(unsigned long, unsigned long),
								unsigned long a, unsigned long b, unsigned long *res)
	{
		struct timespec start, end;
		unsigned long gcd_res;

		start = read_time();
		gcd_res = gcd(a, b);
		end = read_time();

		*res = gcd_res;
		return diff_time(start, end);
	}

	#endif

	static inline unsigned long get_rand()
	{
		if (sizeof(long) == 8)
			return (unsigned long)rand() &lt;&lt; 32 | rand();
		else
			return rand();
	}

	int main(int argc, char **argv)
	{
		unsigned int seed = time(0);
		int loops = 100;
		int repeats = 1000;
		unsigned long (*res)[TEST_ENTRIES];
		unsigned long long elapsed[TEST_ENTRIES];
		int i, j, k;

		for (;;) {
			int opt = getopt(argc, argv, "n:r:s:");
			/* End condition always first */
			if (opt == -1)
				break;

			switch (opt) {
			case 'n':
				loops = atoi(optarg);
				break;
			case 'r':
				repeats = atoi(optarg);
				break;
			case 's':
				seed = strtoul(optarg, NULL, 10);
				break;
			default:
				/* You won't actually get here. */
				break;
			}
		}

		res = malloc(sizeof(unsigned long) * TEST_ENTRIES * loops);
		memset(elapsed, 0, sizeof(elapsed));

		srand(seed);
		for (j = 0; j &lt; loops; j++) {
			unsigned long a = get_rand();
			/* Do we have args? */
			unsigned long b = argc &gt; optind ? strtoul(argv[optind], NULL, 10) : get_rand();
			unsigned long long min_elapsed[TEST_ENTRIES];
			for (k = 0; k &lt; repeats; k++) {
				for (i = 0; i &lt; TEST_ENTRIES; i++) {
					unsigned long long tmp = benchmark_gcd_func(gcd_func[i], a, b, &amp;res[j][i]);
					if (k == 0 || min_elapsed[i] &gt; tmp)
						min_elapsed[i] = tmp;
				}
			}
			for (i = 0; i &lt; TEST_ENTRIES; i++)
				elapsed[i] += min_elapsed[i];
		}

		for (i = 0; i &lt; TEST_ENTRIES; i++)
			printf("gcd%d: elapsed %llu\n", i, elapsed[i]);

		k = 0;
		srand(seed);
		for (j = 0; j &lt; loops; j++) {
			unsigned long a = get_rand();
			unsigned long b = argc &gt; optind ? strtoul(argv[optind], NULL, 10) : get_rand();
			for (i = 1; i &lt; TEST_ENTRIES; i++) {
				if (res[j][i] != res[j][0])
					break;
			}
			if (i &lt; TEST_ENTRIES) {
				if (k == 0) {
					k = 1;
					fprintf(stderr, "Error:\n");
				}
				fprintf(stderr, "gcd(%lu, %lu): ", a, b);
				for (i = 0; i &lt; TEST_ENTRIES; i++)
					fprintf(stderr, "%ld%s", res[j][i], i &lt; TEST_ENTRIES - 1 ? ", " : "\n");
			}
		}

		if (k == 0)
			fprintf(stderr, "PASS\n");

		free(res);

		return 0;
	}

Compiled with "-O2", on "VirtualBox 4.4.0-22-generic #38-Ubuntu x86_64" got:

  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 10174
  gcd1: elapsed 2120
  gcd2: elapsed 2902
  gcd3: elapsed 2039
  gcd4: elapsed 2812
  PASS
  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 9309
  gcd1: elapsed 2280
  gcd2: elapsed 2822
  gcd3: elapsed 2217
  gcd4: elapsed 2710
  PASS
  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 9589
  gcd1: elapsed 2098
  gcd2: elapsed 2815
  gcd3: elapsed 2030
  gcd4: elapsed 2718
  PASS
  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 9914
  gcd1: elapsed 2309
  gcd2: elapsed 2779
  gcd3: elapsed 2228
  gcd4: elapsed 2709
  PASS

[akpm@linux-foundation.org: avoid #defining a CONFIG_ variable]
Signed-off-by: Zhaoxiu Zeng &lt;zhaoxiu.zeng@gmail.com&gt;
Signed-off-by: George Spelvin &lt;linux@horizon.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The binary GCD algorithm is based on the following facts:
	1. If a and b are all evens, then gcd(a,b) = 2 * gcd(a/2, b/2)
	2. If a is even and b is odd, then gcd(a,b) = gcd(a/2, b)
	3. If a and b are all odds, then gcd(a,b) = gcd((a-b)/2, b) = gcd((a+b)/2, b)

Even on x86 machines with reasonable division hardware, the binary
algorithm runs about 25% faster (80% the execution time) than the
division-based Euclidian algorithm.

On platforms like Alpha and ARMv6 where division is a function call to
emulation code, it's even more significant.

There are two variants of the code here, depending on whether a fast
__ffs (find least significant set bit) instruction is available.  This
allows the unpredictable branches in the bit-at-a-time shifting loop to
be eliminated.

If fast __ffs is not available, the "even/odd" GCD variant is used.

I use the following code to benchmark:

	#include &lt;stdio.h&gt;
	#include &lt;stdlib.h&gt;
	#include &lt;stdint.h&gt;
	#include &lt;string.h&gt;
	#include &lt;time.h&gt;
	#include &lt;unistd.h&gt;

	#define swap(a, b) \
		do { \
			a ^= b; \
			b ^= a; \
			a ^= b; \
		} while (0)

	unsigned long gcd0(unsigned long a, unsigned long b)
	{
		unsigned long r;

		if (a &lt; b) {
			swap(a, b);
		}

		if (b == 0)
			return a;

		while ((r = a % b) != 0) {
			a = b;
			b = r;
		}

		return b;
	}

	unsigned long gcd1(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		b &gt;&gt;= __builtin_ctzl(b);

		for (;;) {
			a &gt;&gt;= __builtin_ctzl(a);
			if (a == b)
				return a &lt;&lt; __builtin_ctzl(r);

			if (a &lt; b)
				swap(a, b);
			a -= b;
		}
	}

	unsigned long gcd2(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		r &amp;= -r;

		while (!(b &amp; r))
			b &gt;&gt;= 1;

		for (;;) {
			while (!(a &amp; r))
				a &gt;&gt;= 1;
			if (a == b)
				return a;

			if (a &lt; b)
				swap(a, b);
			a -= b;
			a &gt;&gt;= 1;
			if (a &amp; r)
				a += b;
			a &gt;&gt;= 1;
		}
	}

	unsigned long gcd3(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		b &gt;&gt;= __builtin_ctzl(b);
		if (b == 1)
			return r &amp; -r;

		for (;;) {
			a &gt;&gt;= __builtin_ctzl(a);
			if (a == 1)
				return r &amp; -r;
			if (a == b)
				return a &lt;&lt; __builtin_ctzl(r);

			if (a &lt; b)
				swap(a, b);
			a -= b;
		}
	}

	unsigned long gcd4(unsigned long a, unsigned long b)
	{
		unsigned long r = a | b;

		if (!a || !b)
			return r;

		r &amp;= -r;

		while (!(b &amp; r))
			b &gt;&gt;= 1;
		if (b == r)
			return r;

		for (;;) {
			while (!(a &amp; r))
				a &gt;&gt;= 1;
			if (a == r)
				return r;
			if (a == b)
				return a;

			if (a &lt; b)
				swap(a, b);
			a -= b;
			a &gt;&gt;= 1;
			if (a &amp; r)
				a += b;
			a &gt;&gt;= 1;
		}
	}

	static unsigned long (*gcd_func[])(unsigned long a, unsigned long b) = {
		gcd0, gcd1, gcd2, gcd3, gcd4,
	};

	#define TEST_ENTRIES (sizeof(gcd_func) / sizeof(gcd_func[0]))

	#if defined(__x86_64__)

	#define rdtscll(val) do { \
		unsigned long __a,__d; \
		__asm__ __volatile__("rdtsc" : "=a" (__a), "=d" (__d)); \
		(val) = ((unsigned long long)__a) | (((unsigned long long)__d)&lt;&lt;32); \
	} while(0)

	static unsigned long long benchmark_gcd_func(unsigned long (*gcd)(unsigned long, unsigned long),
								unsigned long a, unsigned long b, unsigned long *res)
	{
		unsigned long long start, end;
		unsigned long long ret;
		unsigned long gcd_res;

		rdtscll(start);
		gcd_res = gcd(a, b);
		rdtscll(end);

		if (end &gt;= start)
			ret = end - start;
		else
			ret = ~0ULL - start + 1 + end;

		*res = gcd_res;
		return ret;
	}

	#else

	static inline struct timespec read_time(void)
	{
		struct timespec time;
		clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;time);
		return time;
	}

	static inline unsigned long long diff_time(struct timespec start, struct timespec end)
	{
		struct timespec temp;

		if ((end.tv_nsec - start.tv_nsec) &lt; 0) {
			temp.tv_sec = end.tv_sec - start.tv_sec - 1;
			temp.tv_nsec = 1000000000ULL + end.tv_nsec - start.tv_nsec;
		} else {
			temp.tv_sec = end.tv_sec - start.tv_sec;
			temp.tv_nsec = end.tv_nsec - start.tv_nsec;
		}

		return temp.tv_sec * 1000000000ULL + temp.tv_nsec;
	}

	static unsigned long long benchmark_gcd_func(unsigned long (*gcd)(unsigned long, unsigned long),
								unsigned long a, unsigned long b, unsigned long *res)
	{
		struct timespec start, end;
		unsigned long gcd_res;

		start = read_time();
		gcd_res = gcd(a, b);
		end = read_time();

		*res = gcd_res;
		return diff_time(start, end);
	}

	#endif

	static inline unsigned long get_rand()
	{
		if (sizeof(long) == 8)
			return (unsigned long)rand() &lt;&lt; 32 | rand();
		else
			return rand();
	}

	int main(int argc, char **argv)
	{
		unsigned int seed = time(0);
		int loops = 100;
		int repeats = 1000;
		unsigned long (*res)[TEST_ENTRIES];
		unsigned long long elapsed[TEST_ENTRIES];
		int i, j, k;

		for (;;) {
			int opt = getopt(argc, argv, "n:r:s:");
			/* End condition always first */
			if (opt == -1)
				break;

			switch (opt) {
			case 'n':
				loops = atoi(optarg);
				break;
			case 'r':
				repeats = atoi(optarg);
				break;
			case 's':
				seed = strtoul(optarg, NULL, 10);
				break;
			default:
				/* You won't actually get here. */
				break;
			}
		}

		res = malloc(sizeof(unsigned long) * TEST_ENTRIES * loops);
		memset(elapsed, 0, sizeof(elapsed));

		srand(seed);
		for (j = 0; j &lt; loops; j++) {
			unsigned long a = get_rand();
			/* Do we have args? */
			unsigned long b = argc &gt; optind ? strtoul(argv[optind], NULL, 10) : get_rand();
			unsigned long long min_elapsed[TEST_ENTRIES];
			for (k = 0; k &lt; repeats; k++) {
				for (i = 0; i &lt; TEST_ENTRIES; i++) {
					unsigned long long tmp = benchmark_gcd_func(gcd_func[i], a, b, &amp;res[j][i]);
					if (k == 0 || min_elapsed[i] &gt; tmp)
						min_elapsed[i] = tmp;
				}
			}
			for (i = 0; i &lt; TEST_ENTRIES; i++)
				elapsed[i] += min_elapsed[i];
		}

		for (i = 0; i &lt; TEST_ENTRIES; i++)
			printf("gcd%d: elapsed %llu\n", i, elapsed[i]);

		k = 0;
		srand(seed);
		for (j = 0; j &lt; loops; j++) {
			unsigned long a = get_rand();
			unsigned long b = argc &gt; optind ? strtoul(argv[optind], NULL, 10) : get_rand();
			for (i = 1; i &lt; TEST_ENTRIES; i++) {
				if (res[j][i] != res[j][0])
					break;
			}
			if (i &lt; TEST_ENTRIES) {
				if (k == 0) {
					k = 1;
					fprintf(stderr, "Error:\n");
				}
				fprintf(stderr, "gcd(%lu, %lu): ", a, b);
				for (i = 0; i &lt; TEST_ENTRIES; i++)
					fprintf(stderr, "%ld%s", res[j][i], i &lt; TEST_ENTRIES - 1 ? ", " : "\n");
			}
		}

		if (k == 0)
			fprintf(stderr, "PASS\n");

		free(res);

		return 0;
	}

Compiled with "-O2", on "VirtualBox 4.4.0-22-generic #38-Ubuntu x86_64" got:

  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 10174
  gcd1: elapsed 2120
  gcd2: elapsed 2902
  gcd3: elapsed 2039
  gcd4: elapsed 2812
  PASS
  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 9309
  gcd1: elapsed 2280
  gcd2: elapsed 2822
  gcd3: elapsed 2217
  gcd4: elapsed 2710
  PASS
  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 9589
  gcd1: elapsed 2098
  gcd2: elapsed 2815
  gcd3: elapsed 2030
  gcd4: elapsed 2718
  PASS
  zhaoxiuzeng@zhaoxiuzeng-VirtualBox:~/develop$ ./gcd -r 500000 -n 10
  gcd0: elapsed 9914
  gcd1: elapsed 2309
  gcd2: elapsed 2779
  gcd3: elapsed 2228
  gcd4: elapsed 2709
  PASS

[akpm@linux-foundation.org: avoid #defining a CONFIG_ variable]
Signed-off-by: Zhaoxiu Zeng &lt;zhaoxiu.zeng@gmail.com&gt;
Signed-off-by: George Spelvin &lt;linux@horizon.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
