OESF Portables Forum
General Forums => General Discussion => Topic started by: lardman on April 19, 2004, 08:16:05 am
-
I\'ve had a bit of a google and there are lots of people saying why their Pocket PCs aren\'t running as fast as they could (because WinCE isn\'t /wasn\'t optimised for ARM5), but the question is really why is this?
I note that the ARM5 chip has longer pipelining which means (so I read) that the instruction execution order must be optimised differently. Does anyone know how much of a difference this makes?
Does anyone know of a definitive comparison of the realative speeds and also of the relative speeds of GCC 3.x using one or the other arch?
Last but not least, I thought I\'d compile dhrystone as ARM5 and ARM4 and do a bit of a test to see whether I can get some figures. Are there better comparison/benchmarking tools than this one?
Cheers,
Si
-
I can\'t get dhrystone to compile for some reason. Any tips from anyone?
Here\'s Whetstone though:
http://sgp.zaurii.net/whetstone_arm4.out (http://sgp.zaurii.net/whetstone_arm4.out)
http://sgp.zaurii.net/whetstone_arm5.out (http://sgp.zaurii.net/whetstone_arm5.out)
I\'ve not tried it yet (don\'t have a link cable with me at uni).
Si
P.S. You might need to chmod +x as I put these up on the web from a Windows box.
-
Right I\'ve got dhrystone compiled. Just for fun I did multiple versions with different -O optimisations:
http://sgp.zaurii.net/dhry21_arm4_o0 (http://sgp.zaurii.net/dhry21_arm4_o0)
http://sgp.zaurii.net/dhry21_arm4_o1 (http://sgp.zaurii.net/dhry21_arm4_o1)
http://sgp.zaurii.net/dhry21_arm4_o2 (http://sgp.zaurii.net/dhry21_arm4_o2)
http://sgp.zaurii.net/dhry21_arm4_o3 (http://sgp.zaurii.net/dhry21_arm4_o3)
http://sgp.zaurii.net/dhry21_arm5_o0 (http://sgp.zaurii.net/dhry21_arm5_o0)
http://sgp.zaurii.net/dhry21_arm5_o1 (http://sgp.zaurii.net/dhry21_arm5_o1)
http://sgp.zaurii.net/dhry21_arm5_o2 (http://sgp.zaurii.net/dhry21_arm5_o2)
http://sgp.zaurii.net/dhry21_arm5_o3 (http://sgp.zaurii.net/dhry21_arm5_o3)
Again, all yet untested.
Si
-
For your delectation:
Note that the alleged ARM5 code ran without troubles on my sl5500 (which has an ARM4 SA processor - unless I got lucky somehow ;-))
Presumably the benchmarks here don\'t use any of the ARM5\'s added instructions (anyone know anything about this?).
C750, OZ3.3.6pre1 64-0
----------------------
Microseconds for one run through Dhrystone
o0 o1 o2 o3
ARM5 3.8 1.9 1.8 1.7
ARM4 3.8 1.8 1.8 1.8
Dhrystones per Second
o0 o1 o2 o3
ARM5 265957.4 538503.0 550660.8 574052.8
ARM4 264760.4 543478.3 553403.4 566893.4
VAX MIPS rating
o0 o1 o2 o3
ARM5 151.37 306.49 313.41 326.723
ARM4 150.689 309.322 314.971 322.649
Whetstone
ARM5 Loops 1000, iterations 1, duration 89sec, C Converted Double Precision Whetstones 1.1MIPS
ARM4 Loops 1000, iterations 1, duration 89sec, C Converted Double Precision Whetstones 1.1MIPS
SL5500, OZ3.3.6pre1 64-0
------------------------
Microseconds for one run through Dhrystone
o0 o1 o2 o3
ARM5 6.8 3.3 3.3 3.2
ARM4 6.8 3.3 3.3 3.2
Dhrystones per Second
o0 o1 o2 o3
ARM5 148016.6 300030.0 301750.2 313087.0
ARM4 147776.0 299490.9 304228.8 314070.4
VAX MIPS rating
o0 o1 o2 o3
ARM5 84.244 170.763 171.742 178.194
ARM4 84.107 170.456 173.152 178.754
Whetstone
ARM5 Loops 1000, iterations 1, duration 151sec, C Converted Double Precision Whetstones 662.3KIPS
ARM4 Loops 1000, iterations 1, duration 151sec, C Converted Double Precision Whetstones 662.3KIPS
P4 1.4GHz, 256MB RAM, Mandrake 9.2
----------------------------------
Microseconds for one run through Dhrystone
o0 o1 o2 o3
0.7 0.6 0.6 0.5
Dhrystones per Second
o0 o1 o2 o3
1342281.9 1769911.5 1798561.2 2155172.4
VAX MIPS rating
o0 o1 o2 o3
763.962 1007.349 1023.655 1226.621
Whetstone
Loops: 1000, Iterations: 1, Duration: 1 sec.
C Converted Double Precision Whetstones: 100.0 MIPS
-
thanks!
-
No problem, not sure what , if anything it proves though.
Basically ARM5 and ARM4 are pretty much the same, ARM5 slightly faster across the board on an XScale machine and about even on an SA machine (random!?). The big difference is in the optimisations, with even o1 making a huge difference, then additional -o number doing not much more.
From this I\'m fairly happy to continue making ARM4 binaries to use on everything. However there are some points to be made:
1. Perhaps these benchmarks haven\'t shown the true differences between the compilers (either in terms of extra instructions which obviously haven\'t been employed or perhaps the optimised pipelining wasn\'t needed, etc.)?
2. Perhaps the optimisation in GCC3.x isn\'t as good as it could be. Any ideas?
Si
-
2. Perhaps the optimisation in GCC3.x isn\'t as good as it could be. Any ideas?
Thats likely. I know for a fact that its true on x86, where Linux programs compiled with the Intel compiler are normally faster than gcc binaries. There is plenty of evidence to support this.
-
The main payoff optimization for arm5 (including xscale) is the instruction scheduling. Due to the pipelining, there are wait states induced for each load. Generaly, if the compiler is smart enough to move a non-dependent instruction into the space between the load and its use, the moved instruction if free. So.....
mov r1,5
ldr r2,[sp]
add r2,r2,r1
will execute faster as....
ldr r2,[sp]
mov r1,5
add r2,r2,r1
because it makes use of the delay slot imposed by the ldr.
-
From my benchmarks it doens\'t look like ARM5 is significantly faster than ARM4 on my C750.
This could be to do with the type of code in the benchmark? or is it to do with \"if the compiler is smart enough\" ;-) ?
Si
-
I think most of the arm compilers now default to turning on instruction scheduling if -O is on even when compiling for pre-arm5 processors. It doesn\'t hurt on non-v5 architectures. It just takes longer to compile. So since you are testing both v4 & v5 on an xscale, you shouldn\'t see much difference. Aside from the scheduling there isn\'t much difference. There are some xscale multiplies but they rarely make a difference in common code. The big difference is if you go down to arm3 and lose the 2-byte load/stores. Now if you run the arm4 on a real arm4 you will see slower times. But ironically, if you take non-scheduled code and run it on both a v4 and a v5, the v4 machine may actually show better performance since tthe v5 will be conflicted with wait states.
-
There are some xscale multiplies but they rarely make a difference in common code.
In fact they are not used in the benchmark (otherwise the ARM5 code wouldn\'t be able to run on my ARM4 5500).
Now if you run the arm4 on a real arm4 you will see slower times.
Slower times with which code? The 5500 is ARM4 and the ARM5 and ARM4 codes are pretty similar running on it.
But ironically, if you take non-scheduled code and run it on both a v4 and a v5, the v4 machine may actually show better performance since tthe v5 will be conflicted with wait states.
Yes I can understand this. I really just wanted to see whether it was worth compiling things as ARM5 (for the alleged speed gains) and then either break compatibility with the majority (who have 5000D and 5500 machines) or have to support two versions. After seeing my results I don\'t think there\'s much to be gained, however I will try some tests on real world applications (in case the benchmark I chose isn\'t representative and in fact the ARM5 code can produce better performance than the ARM4).
Si
-
For any kind of number crunching on Z, I would guess that the kind of floating point emulation used would make the biggest difference. AFAIK, in the present implementation every fp operation causes some sort of exception (which is slow) that the kernel traps to do the emulation. OTOH, gcc\'s configure has options like --with-softfloat-support etc. This kind of fp emulation should be notably faster than the present one. However, it looks like this seems not supported for g77 which many of the number crunching libs still rely on. There also seems to be a file gcc-3.3.2-arm-softfloat.diff.gz around. I\'m not sure what all the implications of using softfloat would be, e.g. for libc. Even if everything, core system and apps, needed to be recompiled, that would not hurt too much. Someone more knowledgable about these matters please shed some light on it.
-
Your right, I\'m thinking pre-Stronarm arm4 (can\'t get that bit unstuck (ouch <banged head on table>)).
Only very slight differences in scheduling between strongarm and xscale. Just the preload instruction which I doubt many compilers are capable of making much use of. The longer pipeline can actully slow down the xscale for poorly scheduled code.
The real difference is in FP as Zazz points out, and is due to the use of ldrd/strd double loads/stores. The biggest complaint about this is it forces the stack to 8-byte boundaries (which I think most compilers are doing for arm4 by default as well so as to hope for an eventual compatibility between the two). But stack intensive programs will notice an increase in runtime space requirements (not speed) with the 8-byte align. For arm4, this can be turned off even if though the ATPCS says it shouldn\'t be. You would only want to do this for a seriously embedded application like a toaster oven where there is no compatibility requirement.
The arm5 floating point will definitely break on arm4 if it uses these doubleword load/stores. But for well-optimized arm5, floating-point emulation and longlong int will see big speed improvements over arm4. Since each doubleword ldrd/strd saves at least a cycle over the ldm/stm used for arm4.
A smart xscale compiler could even use the strd/ldrd for other cases where it finds two consecutive load/stores. This is much less flexible than the ldm/stm, but will save cycles. A real task would be for a compiler that manages register allocation such that it prepares for ldrd/strd by detecting consecutive memory location reads/writes, pre-scheduling loads/stores so that they will be convertable to ldrd/strd by the peep-holer. But this is the kind of demand the xscale (and strongarm) put on the compilers over much simlpler arm (arm7tdmi and the like).