more optimizations...

calculate the offsets from the phase differently, this happens
to reduce the register pressure in the main loop, which in turns
allows the compiler to generate much better code (doesn't need
to spill a lot of stuff on the stack).

this gives another 15% performance increase

Change-Id: I2ce3479dd48b9e6941adb80e6d443d6e14d64d96
2 files changed