Help - Search - Members - Calendar
Full Version: ARM4 vs ARM5 speed comparison?
OESF Portables Forum > General Forums > General Discussion
lardman
I've had a bit of a google and there are lots of people saying why their Pocket PCs aren't running as fast as they could (because WinCE isn't /wasn't optimised for ARM5), but the question is really why is this?

I note that the ARM5 chip has longer pipelining which means (so I read) that the instruction execution order must be optimised differently. Does anyone know how much of a difference this makes?

Does anyone know of a definitive comparison of the realative speeds and also of the relative speeds of GCC 3.x using one or the other arch?

Last but not least, I thought I'd compile dhrystone as ARM5 and ARM4 and do a bit of a test to see whether I can get some figures. Are there better comparison/benchmarking tools than this one?

Cheers,



Si
lardman
I can't get dhrystone to compile for some reason. Any tips from anyone?

Here's Whetstone though:

http://sgp.zaurii.net/whetstone_arm4.out
http://sgp.zaurii.net/whetstone_arm5.out

I've not tried it yet (don't have a link cable with me at uni).


Si

P.S. You might need to chmod +x as I put these up on the web from a Windows box.
lardman
For your delectation:

Note that the alleged ARM5 code ran without troubles on my sl5500 (which has an ARM4 SA processor - unless I got lucky somehow ;-))

Presumably the benchmarks here don't use any of the ARM5's added instructions (anyone know anything about this?).

CODE
C750, OZ3.3.6pre1 64-0

----------------------      



Microseconds for one run through Dhrystone

    o0    o1    o2    o3

ARM5    3.8    1.9    1.8    1.7    

ARM4    3.8    1.8    1.8    1.8



Dhrystones per Second

    o0    o1    o2    o3

ARM5    265957.4    538503.0    550660.8    574052.8

ARM4    264760.4    543478.3    553403.4    566893.4



VAX MIPS rating

    o0    o1    o2    o3

ARM5    151.37    306.49    313.41    326.723    

ARM4    150.689    309.322    314.971    322.649



Whetstone

ARM5    Loops 1000, iterations 1, duration 89sec, C Converted Double Precision Whetstones 1.1MIPS      

ARM4    Loops 1000, iterations 1, duration 89sec, C Converted Double Precision Whetstones 1.1MIPS      









SL5500, OZ3.3.6pre1 64-0

------------------------



Microseconds for one run through Dhrystone

    o0    o1    o2    o3

ARM5    6.8    3.3    3.3    3.2

ARM4    6.8    3.3    3.3    3.2



Dhrystones per Second

    o0    o1    o2    o3

ARM5    148016.6    300030.0    301750.2    313087.0

ARM4    147776.0    299490.9    304228.8    314070.4



VAX MIPS rating

    o0    o1    o2    o3

ARM5    84.244    170.763    171.742    178.194

ARM4    84.107    170.456    173.152    178.754



Whetstone

ARM5    Loops 1000, iterations 1, duration 151sec, C Converted Double Precision Whetstones 662.3KIPS    

ARM4    Loops 1000, iterations 1, duration 151sec, C Converted Double Precision Whetstones 662.3KIPS









P4 1.4GHz, 256MB RAM, Mandrake 9.2

----------------------------------



Microseconds for one run through Dhrystone

    o0    o1    o2    o3

    0.7    0.6    0.6    0.5



Dhrystones per Second

    o0    o1    o2    o3

    1342281.9    1769911.5    1798561.2    2155172.4



VAX MIPS rating

    o0    o1    o2    o3

    763.962    1007.349    1023.655    1226.621



Whetstone

Loops: 1000, Iterations: 1, Duration: 1 sec.

C Converted Double Precision Whetstones: 100.0 MIPS
ScottYelich
thanks!
lardman
No problem, not sure what , if anything it proves though.

Basically ARM5 and ARM4 are pretty much the same, ARM5 slightly faster across the board on an XScale machine and about even on an SA machine (random!?). The big difference is in the optimisations, with even o1 making a huge difference, then additional -o number doing not much more.

From this I'm fairly happy to continue making ARM4 binaries to use on everything. However there are some points to be made:

1. Perhaps these benchmarks haven't shown the true differences between the compilers (either in terms of extra instructions which obviously haven't been employed or perhaps the optimised pipelining wasn't needed, etc.)?
2. Perhaps the optimisation in GCC3.x isn't as good as it could be. Any ideas?


Si
DrWowe
QUOTE
2. Perhaps the optimisation in GCC3.x isn't as good as it could be. Any ideas?


Thats likely. I know for a fact that its true on x86, where Linux programs compiled with the Intel compiler are normally faster than gcc binaries. There is plenty of evidence to support this.
mjalkut
The main payoff optimization for arm5 (including xscale) is the instruction scheduling. Due to the pipelining, there are wait states induced for each load. Generaly, if the compiler is smart enough to move a non-dependent instruction into the space between the load and its use, the moved instruction if free. So.....

mov r1,5
ldr r2,[sp]
add r2,r2,r1

will execute faster as....

ldr r2,[sp]
mov r1,5
add r2,r2,r1

because it makes use of the delay slot imposed by the ldr.
lardman
From my benchmarks it doens't look like ARM5 is significantly faster than ARM4 on my C750.

This could be to do with the type of code in the benchmark? or is it to do with "if the compiler is smart enough" ;-) ?


Si
mjalkut
I think most of the arm compilers now default to turning on instruction scheduling if -O is on even when compiling for pre-arm5 processors. It doesn't hurt on non-v5 architectures. It just takes longer to compile. So since you are testing both v4 & v5 on an xscale, you shouldn't see much difference. Aside from the scheduling there isn't much difference. There are some xscale multiplies but they rarely make a difference in common code. The big difference is if you go down to arm3 and lose the 2-byte load/stores. Now if you run the arm4 on a real arm4 you will see slower times. But ironically, if you take non-scheduled code and run it on both a v4 and a v5, the v4 machine may actually show better performance since tthe v5 will be conflicted with wait states.
lardman
QUOTE
There are some xscale multiplies but they rarely make a difference in common code.


In fact they are not used in the benchmark (otherwise the ARM5 code wouldn't be able to run on my ARM4 5500).

QUOTE
Now if you run the arm4 on a real arm4 you will see slower times.


Slower times with which code? The 5500 is ARM4 and the ARM5 and ARM4 codes are pretty similar running on it.

QUOTE
But ironically, if you take non-scheduled code and run it on both a v4 and a v5, the v4 machine may actually show better performance since tthe v5 will be conflicted with wait states.


Yes I can understand this. I really just wanted to see whether it was worth compiling things as ARM5 (for the alleged speed gains) and then either break compatibility with the majority (who have 5000D and 5500 machines) or have to support two versions. After seeing my results I don't think there's much to be gained, however I will try some tests on real world applications (in case the benchmark I chose isn't representative and in fact the ARM5 code can produce better performance than the ARM4).


Si
Zazz
For any kind of number crunching on Z, I would guess that the kind of floating point emulation used would make the biggest difference. AFAIK, in the present implementation every fp operation causes some sort of exception (which is slow) that the kernel traps to do the emulation. OTOH, gcc's configure has options like --with-softfloat-support etc. This kind of fp emulation should be notably faster than the present one. However, it looks like this seems not supported for g77 which many of the number crunching libs still rely on. There also seems to be a file gcc-3.3.2-arm-softfloat.diff.gz around. I'm not sure what all the implications of using softfloat would be, e.g. for libc. Even if everything, core system and apps, needed to be recompiled, that would not hurt too much. Someone more knowledgable about these matters please shed some light on it.
mjalkut
Your right, I'm thinking pre-Stronarm arm4 (can't get that bit unstuck (ouch <banged head on table>)).

Only very slight differences in scheduling between strongarm and xscale. Just the preload instruction which I doubt many compilers are capable of making much use of. The longer pipeline can actully slow down the xscale for poorly scheduled code.

The real difference is in FP as Zazz points out, and is due to the use of ldrd/strd double loads/stores. The biggest complaint about this is it forces the stack to 8-byte boundaries (which I think most compilers are doing for arm4 by default as well so as to hope for an eventual compatibility between the two). But stack intensive programs will notice an increase in runtime space requirements (not speed) with the 8-byte align. For arm4, this can be turned off even if though the ATPCS says it shouldn't be. You would only want to do this for a seriously embedded application like a toaster oven where there is no compatibility requirement.

The arm5 floating point will definitely break on arm4 if it uses these doubleword load/stores. But for well-optimized arm5, floating-point emulation and longlong int will see big speed improvements over arm4. Since each doubleword ldrd/strd saves at least a cycle over the ldm/stm used for arm4.

A smart xscale compiler could even use the strd/ldrd for other cases where it finds two consecutive load/stores. This is much less flexible than the ldm/stm, but will save cycles. A real task would be for a compiler that manages register allocation such that it prepares for ldrd/strd by detecting consecutive memory location reads/writes, pre-scheduling loads/stores so that they will be convertable to ldrd/strd by the peep-holer. But this is the kind of demand the xscale (and strongarm) put on the compilers over much simlpler arm (arm7tdmi and the like).
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2019 Invision Power Services, Inc.