bionic: squashed optimizations/fixes from Jim Huang
*Commit 1 of 9*
Use GCC's __attribute__((const)) to reduce code size
__attribute__((const)) is mainly intended for the compiler to optimize
away repeated calls to a function that the compiler knows will return
the same value repeatedly.
By adding __attribute__((const)), the compiler can choose to call the
function just once and cache the return value. Therefore, this yields
code size reduction.
Here are the reference results by arm-eabi-size for crespo device:
[before]
text data bss dec hex filename
267715 10132 45948 323795 4f0d3
[after]
text data bss dec hex filename
267387 10132 45948 323467 4ef8b
Change-Id: I1d80465c0f88158449702d4dc6398a130eb77195
*Commit 2 of 9*
res_send: Avoid spurious close()s and (rare) failure
When looping over the current list of sockets we are connected to,
use getpeername() not getsockname() to find out who the remote
end is. This change avoids spurious close() and (rare) failure.
ISC bug #18625 and fixed in libbind 6.0
Change-Id: I5e85f9ff4b98c237978e4bf4bd85ba0a90d768e6
*Commit 3 of 9*
sha1: code cleanup and use modern C syntax
Apply the following changes:
- Remove out-of-date workaround (SPARC64_GCC_WORKAROUND)
- Use C99 prototype and stdint type
Change-Id: I630cf97f6824f72f4165e0fa9e5bfdad8edabe48
*Commit 4 of 9*
sha1: Use bswap* to optimize byte order
bionic libc already makes use of ARMv6+ rev/rev16 instruction for
endian conversion, and this patch rewrites some parts of SHA1
implementations with swap32 and swap64 routines, which is known to
bring performance improvements.
The reference sha1bench on Nexus S:
[before]
Rounds: 100000, size: 6250K, time: 1.183s, speed: 5.16 MB/s
[after]
Rounds: 100000, size: 6250K, time: 1.025s, speed: 5.957 MB/sB
Change-Id: Id04c0fa1467b3006b5a8736cbdd95855ed7c13e4
*Commit 5 of 9*
linker: optimize SysV ELF hash function
This change can avoid one iterative operation in inner loop.
Inspired by glibc.
Change-Id: I3f641c086654809574289fa6eba0ee1d32e79aa3
*Commit 6 of 9*
Add ARMv7 optimized strlen()
Merge the ARM optimized strlen() routine from Linaro. Although it is
optimized for ARM Cortex-A9, the performance is still reasonably faster
than the original on Cortex-A8 machines.
Reference benchmark on Nexus S (ARM Cortex-A8; 1 GHz):
[before]
prc thr usecs/call samples errors cnt/samp size
strlen_1k 1 1 1.31712 97 0 1000 1024
[after]
prc thr usecs/call samples errors cnt/samp size
strlen_1k 1 1 1.05855 96 0 1000 1024
Change-Id: I809928804726620f399510af1cd1c852ed754403
*Commit 7 of 9*
fix ARMv7 optimized strlen() usage condition (author: nadlabak)
Change-Id: Ia2ab059b092f80c02d95ca95d3062954c0ad1023
*Commit 8 of 9*
memmove: Fix the abuse of memcpy() for overlapping regions
memcpy is not defined for overlapping regions.
Original author: Chris Dearman <chris@mips.com>
Change-Id: Icc2acc860c932eaf1df488630146f4e07388a444
*Commit 9 of 9*
memcmp: prefetch optimizing for ARM Cortex-A8/A9
The original memcmp() was tweaked for ARM9, which is not optimal for ARM
Cortex-A cores. This patch merges the prefetch optimizations from
ST-Ericsson and removes NEON slowdowns.
Reference experiement results on Nexus S (ARM Cortex-A8; 1 GHz) using
strbench program:
http://pasky.or.cz//dev/glibc/strbench/
[before]
size, samples, TIMES[s] - user, system, total)
4 262144 2.510000 0.000000 2.510000
8 131072 1.570000 0.010000 1.590000
32 32768 1.310000 0.000000 1.320000
[after]
size, samples, TIMES[s] - user, system, total)
4 262144 2.280000 0.000000 2.290000
8 131072 1.210000 0.000000 1.220000
32 32768 1.040000 0.000000 1.050000
Change-Id: I961847da96d2025f7049773cd2ddaa08579e78d6
diff --git a/libc/string/memmove.c b/libc/string/memmove.c
index 072104b..7c1e9b2 100644
--- a/libc/string/memmove.c
+++ b/libc/string/memmove.c
@@ -32,10 +32,10 @@
{
const char *p = src;
char *q = dst;
- /* We can use the optimized memcpy if the destination is below the
- * source (i.e. q < p), or if it is completely over it (i.e. q >= p+n).
+ /* We can use the optimized memcpy if the destination is completely below the
+ * source (i.e. q+n <= p), or if it is completely over it (i.e. q >= p+n).
*/
- if (__builtin_expect((q < p) || ((size_t)(q - p) >= n), 1)) {
+ if (__builtin_expect((q + n < p) || (q >= p + n), 1)) {
return memcpy(dst, src, n);
} else {
bcopy(src, dst, n);