refactor code to improve neon code

we want to make sure we don't transfer data from the
neon unit to the arm register file, as this can be quite
slow. instead we do all the calculation on the neon side
and write the result directly to main memory.

Change-Id: Ibb56664d3ab03098ae2798b75e2b6927ac900187
2 files changed