net: output path optimizations

1) Avoid dirtying neighbour's confirmed field.

  TCP workloads hits this cache line for each incoming ACK.
  Lets write n->confirmed only if there is a jiffie change.

2) Optimize neigh_hh_output() for the common Ethernet case, were
   hh_len is less than 16 bytes. Replace the memcpy() call
   by two inlined 64bit load/stores on x86_64.

Bench results using udpflood test, with -C option (MSG_CONFIRM flag
added to sendto(), to reproduce the n->confirmed dirtying on UDP)

24 threads doing 1.000.000 UDP sendto() on dummy device, 4 runs.

before : 2.247s, 2.235s, 2.247s, 2.318s
after  : 1.884s, 1.905s, 1.891s, 1.895s

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/dst.h b/include/net/dst.h
index baf5978..77f52f7 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -396,11 +396,15 @@
 static inline int dst_neigh_output(struct dst_entry *dst, struct neighbour *n,
 				   struct sk_buff *skb)
 {
-	struct hh_cache *hh;
+	const struct hh_cache *hh;
 
-	if (unlikely(dst->pending_confirm)) {
-		n->confirmed = jiffies;
+	if (dst->pending_confirm) {
+		unsigned long now = jiffies;
+
 		dst->pending_confirm = 0;
+		/* avoid dirtying neighbour */
+		if (n->confirmed != now)
+			n->confirmed = now;
 	}
 
 	hh = &n->hh;