Some information about mplayer benchmarking. It contains
-benchmark option which can measure time spent for decoding video, displaying video (including scaling and color conversion) and audio.
One of the options that affect decoding performance is idct implemntation. It can be specified by using
-lavdopts idct=# where # is some decimal number. MPlayer man contains the following information:
idct=<0-99>
IDCT algorithm
NOTE: To the best of our knowledge all these IDCTs do pass the IEEE1180 tests.
0 Automatically select a good one (default).
1 JPEG reference integer
2 simple
3 simplemmx
4 libmpeg2mmx (inaccurate, do not use for encoding with keyint >100)
5 ps2
6 mlib
7 arm
8 AltiVec
9 sh4
But man pages are a bit incomplete and more information can be found in libavcodec/avcodec.h:
#define FF_IDCT_AUTO 0
#define FF_IDCT_INT 1
#define FF_IDCT_SIMPLE 2
#define FF_IDCT_SIMPLEMMX 3
#define FF_IDCT_LIBMPEG2MMX 4
#define FF_IDCT_PS2 5
#define FF_IDCT_MLIB 6
#define FF_IDCT_ARM 7
#define FF_IDCT_ALTIVEC 8
#define FF_IDCT_SH4 9
#define FF_IDCT_SIMPLEARM 10
#define FF_IDCT_H264 11
#define FF_IDCT_VP3 12
#define FF_IDCT_IPP 13
#define FF_IDCT_XVIDMMX 14
#define FF_IDCT_CAVS 15
#define FF_IDCT_SIMPLEARMV5TE 16
The following idct implementations can be interesting on ARM:
#define FF_IDCT_ARM 7 (default idct that was used for ARM)
#define FF_IDCT_SIMPLEARM 10#define FF_IDCT_SIMPLEARMV5TE 16 (recently added in mplayer 1.0rc1)
In order to benchmark video decoding I used the
following video clip (10MB version, MD5=1d62b8819bf1433df0dc9b5257f4fc35). Direct link is here:
http://trailers.divx.com/Universal/Doom.divxIt does not matter which video to take, my only concern was that it should be freely downloadable in order to be able to compare results from different machines.
My setup is MPlayer 1.0rc1, Nokia 770 (ARM926EJS 250MHz), gcc version 3.4.4 (release) (CodeSourcery ARM 2005q3-2), configured with CFLAGS="-O4 -mcpu=arm926ej-s -fomit-frame-pointer -ffast-math"
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=7 Doom.divx | grep BENCHMARKsBENCHMARKs: VC: 67.369s VO: 0.075s A: 0.000s Sys: 0.600s = 68.043s
BENCHMARKs: VC: 69.296s VO: 0.075s A: 0.000s Sys: 0.630s = 70.001s
BENCHMARKs: VC: 69.346s VO: 0.075s A: 0.000s Sys: 0.622s = 70.044s
BENCHMARKs: VC: 70.332s VO: 0.074s A: 0.000s Sys: 0.674s = 71.080s
BENCHMARKs: VC: 70.067s VO: 0.074s A: 0.000s Sys: 0.617s = 70.758s
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=10 Doom.divx | grep BENCHMARKsBENCHMARKs: VC: 69.828s VO: 0.072s A: 0.000s Sys: 0.605s = 70.506s
BENCHMARKs: VC: 71.838s VO: 0.073s A: 0.000s Sys: 0.629s = 72.539s
BENCHMARKs: VC: 71.903s VO: 0.074s A: 0.000s Sys: 0.634s = 72.611s
BENCHMARKs: VC: 72.563s VO: 0.073s A: 0.000s Sys: 0.626s = 73.262s
BENCHMARKs: VC: 72.373s VO: 0.073s A: 0.000s Sys: 0.653s = 73.099s
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx | grep BENCHMARKsBENCHMARKs: VC: 64.130s VO: 0.074s A: 0.000s Sys: 0.641s = 64.845s
BENCHMARKs: VC: 65.372s VO: 0.074s A: 0.000s Sys: 0.665s = 66.111s
BENCHMARKs: VC: 65.493s VO: 0.075s A: 0.000s Sys: 0.640s = 66.208s
BENCHMARKs: VC: 66.321s VO: 0.076s A: 0.000s Sys: 0.629s = 67.026s
BENCHMARKs: VC: 66.202s VO: 0.075s A: 0.000s Sys: 0.642s = 66.919s
Here is also the result for FF_IDCT_SIMPLE (just C implementation with no assembly) for comparison:
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=2 Doom.divx | grep BENCHMARKsBENCHMARKs: VC: 71.117s VO: 0.072s A: 0.000s Sys: 0.622s = 71.811s
BENCHMARKs: VC: 72.435s VO: 0.072s A: 0.000s Sys: 0.598s = 73.105s
BENCHMARKs: VC: 72.576s VO: 0.073s A: 0.000s Sys: 0.663s = 73.312s
BENCHMARKs: VC: 73.364s VO: 0.074s A: 0.000s Sys: 0.660s = 74.098s
BENCHMARKs: VC: 73.304s VO: 0.073s A: 0.000s Sys: 0.637s = 74.014s
So the fastest idct for Nokia 770 is FF_IDCT_SIMPLEARMV5TE (number 16), it has some optimizations using armv5te dsp instructions (single cycle 16 x 16 bit multiplication). It is also the default setting for any cpu that supports armv5te instructions in mplayer 1.0rc1 now. This code is the first revision and most likely can be optimized even more. Also the overall results difference because of using different idct implementations use may vary for different video files, I observed performance improvement of up to 10% (on high bitrate but low resolution movies). For this particular file we see that the improvement is only about 6%.
A strange thing here in these benchmarks is that the results are a bit nonconsistent and decoding time slightly increases with each new cycle iteration.
It would be very interesting to see some benchmark results from Zaurus to see which idct works best for it. MPlayer and ffmpeg don't have any iwmmxt optimized idct right now (and it could provide some improvement as it should be able to do two 16 x 16 bit multiplications per cycle).
So more benchmarks are welcome, preferably using the same test file. Or you can suggest some other sample for testing. Also after running these benchmarks, we can see how big is the performance difference between Nokia 770 and Zaurus hardware, which also might be interesting to know