I once participated in an elliptic curve cryptography cracking project hosted by Robert Harley. On the Alpha processor-based computer that I used for that effort, Robert Harley's code ran faster with the -O2 optimization flag than the -O3 optimization flag. I am guessing that the speed optimizations such as loop unrolling and function inlining, which cause the size of the code to grow, caused instructions in the instruction cache to be displaced by redundant code, slowing down the execution of the non-redundant code as it had to be fetched from main memory. You could try the -O2 compiler option and see if the code runs faster.