bionic: squashed optimizations/fixes from Jim Huang

*Commit 1 of 9*
Use GCC's __attribute__((const)) to reduce code size

__attribute__((const)) is mainly intended for the compiler to optimize
away repeated calls to a function that the compiler knows will return
the same value repeatedly.

By adding __attribute__((const)), the compiler can choose to call the
function just once and cache the return value.  Therefore, this yields
code size reduction.

Here are the reference results by arm-eabi-size for crespo device:

[before]
   text    data     bss     dec     hex filename
 267715   10132   45948  323795   4f0d3

[after]
   text    data     bss     dec     hex filename
 267387   10132   45948  323467   4ef8b

Change-Id: I1d80465c0f88158449702d4dc6398a130eb77195

*Commit 2 of 9*
res_send: Avoid spurious close()s and (rare) failure

When looping over the current list of sockets we are connected to,
use getpeername() not getsockname() to find out who the remote
end is.  This change avoids spurious close() and (rare) failure.

ISC bug #18625 and fixed in libbind 6.0

Change-Id: I5e85f9ff4b98c237978e4bf4bd85ba0a90d768e6

*Commit 3 of 9*
sha1: code cleanup and use modern C syntax

Apply the following changes:
- Remove out-of-date workaround (SPARC64_GCC_WORKAROUND)
- Use C99 prototype and stdint type

Change-Id: I630cf97f6824f72f4165e0fa9e5bfdad8edabe48

*Commit 4 of 9*
sha1: Use bswap* to optimize byte order

bionic libc already makes use of ARMv6+ rev/rev16 instruction for
endian conversion, and this patch rewrites some parts of SHA1
implementations with swap32 and swap64 routines, which is known to
bring performance improvements.

The reference sha1bench on Nexus S:

[before]
Rounds: 100000, size: 6250K, time: 1.183s, speed: 5.16  MB/s

[after]
Rounds: 100000, size: 6250K, time: 1.025s, speed: 5.957 MB/sB

Change-Id: Id04c0fa1467b3006b5a8736cbdd95855ed7c13e4

*Commit 5 of 9*
linker: optimize SysV ELF hash function

This change can avoid one iterative operation in inner loop.
Inspired by glibc.

Change-Id: I3f641c086654809574289fa6eba0ee1d32e79aa3

*Commit 6 of 9*
Add ARMv7 optimized strlen()

Merge the ARM optimized strlen() routine from Linaro.  Although it is
optimized for ARM Cortex-A9, the performance is still reasonably faster
than the original on Cortex-A8 machines.

Reference benchmark on Nexus S (ARM Cortex-A8; 1 GHz):

[before]
             prc thr   usecs/call      samples   errors cnt/samp     size
strlen_1k      1   1      1.31712           97        0     1000     1024

[after]
             prc thr   usecs/call      samples   errors cnt/samp     size
strlen_1k      1   1      1.05855           96        0     1000     1024

Change-Id: I809928804726620f399510af1cd1c852ed754403

*Commit 7 of 9*
fix ARMv7 optimized strlen() usage condition (author: nadlabak)

Change-Id: Ia2ab059b092f80c02d95ca95d3062954c0ad1023

*Commit 8 of 9*
memmove: Fix the abuse of memcpy() for overlapping regions

memcpy is not defined for overlapping regions.

Original author: Chris Dearman <chris@mips.com>

Change-Id: Icc2acc860c932eaf1df488630146f4e07388a444

*Commit 9 of 9*
memcmp: prefetch optimizing for ARM Cortex-A8/A9

The original memcmp() was tweaked for ARM9, which is not optimal for ARM
Cortex-A cores.  This patch merges the prefetch optimizations from
ST-Ericsson and removes NEON slowdowns.

Reference experiement results on Nexus S (ARM Cortex-A8; 1 GHz) using
strbench program:
    http://pasky.or.cz//dev/glibc/strbench/

[before]
size, samples, TIMES[s] - user, system, total)
   4   262144         2.510000 0.000000 2.510000
   8   131072         1.570000 0.010000 1.590000
  32    32768         1.310000 0.000000 1.320000

[after]
size, samples, TIMES[s] - user, system, total)
   4   262144         2.280000 0.000000 2.290000
   8   131072         1.210000 0.000000 1.220000
  32    32768         1.040000 0.000000 1.050000

Change-Id: I961847da96d2025f7049773cd2ddaa08579e78d6
diff --git a/libc/string/memmove.c b/libc/string/memmove.c
index 072104b..7c1e9b2 100644
--- a/libc/string/memmove.c
+++ b/libc/string/memmove.c
@@ -32,10 +32,10 @@
 {
   const char *p = src;
   char *q = dst;
-  /* We can use the optimized memcpy if the destination is below the
-   * source (i.e. q < p), or if it is completely over it (i.e. q >= p+n).
+  /* We can use the optimized memcpy if the destination is completely below the
+   * source (i.e. q+n <= p), or if it is completely over it (i.e. q >= p+n).
    */
-  if (__builtin_expect((q < p) || ((size_t)(q - p) >= n), 1)) {
+  if (__builtin_expect((q + n < p) || (q >= p + n), 1)) {
     return memcpy(dst, src, n);
   } else {
     bcopy(src, dst, n);