OESF Portables Forum
Everything Else => Zaurus - Everything Development => Distros, Development, and Model Specific Forums => Archived Forums => Linux Applications => Topic started by: Serge on December 05, 2006, 05:43:15 pm
-
Probably it is a good idea to consolidate efforts and try to submit some of the useful ARM related patches upstream:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ust/014460.html (http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-August/014460.html)
http://lists.mplayerhq.hu/pipermail/mplaye...ber/046207.html (http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/2006-September/046207.html)
I can only test MPlayer on Nokia 770, so can't be sure if any ARM9E (that's the core used in Nokia 770) specific optimizations are also good for Zaurus. So people who are able to compile MPlayer from sources and test it on zaurus are welcome in this thread. One of the examples is the new armv5te optimized idct in MPlayer 1.0rc1, can anybody benchmark it on Zaurus?
Also this is not quite ARM architecture related, but libmad based decoder in MPlayer seems to have troubles with variable bitrate audio (it loses sync with video). Some more details can be found here http://lists.mplayerhq.hu/pipermail/mplaye...ust/045017.html (http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/2006-August/045017.html) and in the followup messages. Any volunteer to investigate this problem?
All in all, ffmpeg optimizations for ARM are not nearly as good as for x86, so investing some time in it may provide some performance improvement.
-
I second that a better player would be great
Im a noob with linux but if I can help in one way or another I would be pleased to
see you
-
Hi!
Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code.
Cheers,
Ludo.
-
Hi!
Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code.
Cheers,
Ludo.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=148412\"][{POST_SNAPBACK}][/a][/div]
mpeg-video decoder isn't 99% of mplayer-atty, and those bits are in upstream mplayer as well.
-
mpeg-video decoder isn't 99% of mplayer-atty, and those bits are in upstream mplayer as well.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=148415\"][{POST_SNAPBACK}][/a][/div]
I'm not being sarcastic... it's overwhelming how much you know. I hope you don't start using that power for evil one day.
-
Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code.
Well, that's very good. Can anybody verify that this iwmmx code works correctly and submit everything that is usable upstream? If it is already there, can you confirm that it is really in a good shape?
I know that some of the atty's code was committed to upstream mplayer source tree (you can check SVN changelog), but I doubt that anyone tested it. The check for iwmmx availability was only added to MPlayer configure script in 1.0rc1 release. So up until this last release, it was not usable without additional patches.
Speaking of iwmmx optimizations, idct code still does not use iwmmx in MPlayer at all, and it is one of the most performance critical parts of code. Only the last MPlayer release got armv5te optimized idct, which was optimized according to http://www.arm.com/pdfs/DDI0222B_9EJS_r1p2.pdf (http://www.arm.com/pdfs/DDI0222B_9EJS_r1p2.pdf) (ARM9E instruction timings). As far as I know, it was developed and tested for Nokia 770 and it really improved mpeg4 decoding performance for about 10%. Most likely this code is not very good for XScale, as XScale has a much more complicated pipeline with lots of interlocks if code is not arranged as it likes (see http://download.intel.com/design/intelxscale/27347302.pdf (http://download.intel.com/design/intelxscale/27347302.pdf)). I wonder if some 'blended' idct code can be developed or it is better to have separate implementations for ARM9E and XScale. Anyway, it needs to be benchmarked first before making any decisions.
In addition, Zaurus builds of MPlayer seem to use some additional modules for hardware accelerated video output. I wonder if it is a good idea to contribute them upstream? MPlayer seems to have special video output code for some old 3dfx and matrox video cards, I doubt that zaurus specific video output code is something that is more exotic and not worth being supported upstream
-
The check for iwmmx availability was only added to MPlayer configure script in 1.0rc1 release. So up until this last release, it was not usable without additional patches.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=148427\"][{POST_SNAPBACK}][/a][/div]
We had patches for that in OE I haven't tested it yet, though.
-
I'm very happy to learn that the ARM specific parts of mplayer are being actively developed so please keep us updated on its progress serge.
Antikx: I agree! I don't think any event in the world of OSS and computer hardware can escape the all pervading attention of the supreme tech oracle that is koen- seriously! I think that man must have embedded RSS,email and web browser in his head that he can monitor and post to even when asleep
-
I'm very happy to learn that the ARM specific parts of mplayer are being actively developed so please keep us updated on its progress serge.
Well, 'actively developed' is a gross overestimation I don't think anybody else is working on ARM optimizations for ffmpeg right now. And I currently switched to the development of Nokia 770 hardware accelerated video output code: http://maemo.org/pipermail/maemo-developer...ber/006646.html (http://maemo.org/pipermail/maemo-developers/2006-December/006646.html)
Anyway, further optimizations for decoder are still needed. That is if we want to at least make an attempt of getting proper playback support for nonconverted video Having to convert everything to 320x240 (or to 400x224 for 16:9) is not much fun. You are lucky to have faster CPU in Zaurus
-
Just to keep you informed, the work on implementing MPlayer video output driver with hardware YUV support for Nokia 770 is more or less finished. At least it is in usable state now.
But in order to get good performance for any video resolutions, optimized YV12->YUY2 scaler is still needed on Nokia 770. By the way, how does Zaurus handle video scaling? Is it hardware accelerated or a software scaler is used? If it is software scaler, what YUV format is used for output?
Here is some mplayer log console output from Nokia 770 (video is software scaled to 400x210 and then hardware pixel doubling is used to show it fullscreen as 800x420):
VO: [nokia770] 336x176 => 336x176 Planar YV12 [fs]
SwScaler: reducing / aligning filtersize 2 -> 2
SwScaler: reducing / aligning filtersize 2 -> 2
SwScaler: reducing / aligning filtersize 2 -> 2
SwScaler: reducing / aligning filtersize 2 -> 2
SwScaler: FAST_BILINEAR scaler, from yuv420p to yuyv422 using C
SwScaler: using FAST_BILINEAR C scaler for horizontal scaling
SwScaler: using 2-tap linear C scaler for vertical scaling (BGR)
SwScaler: 336x176 -> 400x210
What do you usually observe on your Zaurus?
-
By the way, how does Zaurus handle video scaling? Is it hardware accelerated or a software scaler is used? If it is software scaler, what YUV format is used for output?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=149034\"][{POST_SNAPBACK}][/a][/div]
That depends on the models, but basically:
* collie: no acceleration at all
* poodle: ditto
* c7x0: ati imageon w100 which can do limited scaling, YUV transform and idct (http://libw100.sf.net/)
* cxxxx: pxa270fb, which doesn't do scaling AFAIK, but can do YUV transforms and has a small amount of SRAM to do faster blitting when using QVGA.
The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.
The c7x0 models would benefit from people helping the libw100 project.
-
The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.
The c7x0 models would benefit from people helping the libw100 project.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=149038\"][{POST_SNAPBACK}][/a][/div]
'XorA' in #oe on irc.freenode.net is our resident mplayer guru and 'sirfred' the w100 guru.
-
Some information about mplayer benchmarking. It contains -benchmark option which can measure time spent for decoding video, displaying video (including scaling and color conversion) and audio.
One of the options that affect decoding performance is idct implemntation. It can be specified by using -lavdopts idct=# where # is some decimal number. MPlayer man contains the following information:
idct=<0-99>
IDCT algorithm
NOTE: To the best of our knowledge all these IDCTs do pass the IEEE1180 tests.
0 Automatically select a good one (default).
1 JPEG reference integer
2 simple
3 simplemmx
4 libmpeg2mmx (inaccurate, do not use for encoding with keyint >100)
5 ps2
6 mlib
7 arm
8 AltiVec
9 sh4
But man pages are a bit incomplete and more information can be found in libavcodec/avcodec.h:
#define FF_IDCT_AUTO 0
#define FF_IDCT_INT 1
#define FF_IDCT_SIMPLE 2
#define FF_IDCT_SIMPLEMMX 3
#define FF_IDCT_LIBMPEG2MMX 4
#define FF_IDCT_PS2 5
#define FF_IDCT_MLIB 6
#define FF_IDCT_ARM 7
#define FF_IDCT_ALTIVEC 8
#define FF_IDCT_SH4 9
#define FF_IDCT_SIMPLEARM 10
#define FF_IDCT_H264 11
#define FF_IDCT_VP3 12
#define FF_IDCT_IPP 13
#define FF_IDCT_XVIDMMX 14
#define FF_IDCT_CAVS 15
#define FF_IDCT_SIMPLEARMV5TE 16
The following idct implementations can be interesting on ARM:
#define FF_IDCT_ARM 7 (default idct that was used for ARM)
#define FF_IDCT_SIMPLEARM 10
#define FF_IDCT_SIMPLEARMV5TE 16 (recently added in mplayer 1.0rc1)
In order to benchmark video decoding I used the following video clip (http://www.divx.com/movies/detail.php?movieID=57&cID=1) (10MB version, MD5=1d62b8819bf1433df0dc9b5257f4fc35). Direct link is here: http://trailers.divx.com/Universal/Doom.divx (http://trailers.divx.com/Universal/Doom.divx)
It does not matter which video to take, my only concern was that it should be freely downloadable in order to be able to compare results from different machines.
My setup is MPlayer 1.0rc1, Nokia 770 (ARM926EJS 250MHz), gcc version 3.4.4 (release) (CodeSourcery ARM 2005q3-2), configured with CFLAGS="-O4 -mcpu=arm926ej-s -fomit-frame-pointer -ffast-math"
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=7 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 67.369s VO: 0.075s A: 0.000s Sys: 0.600s = 68.043s
BENCHMARKs: VC: 69.296s VO: 0.075s A: 0.000s Sys: 0.630s = 70.001s
BENCHMARKs: VC: 69.346s VO: 0.075s A: 0.000s Sys: 0.622s = 70.044s
BENCHMARKs: VC: 70.332s VO: 0.074s A: 0.000s Sys: 0.674s = 71.080s
BENCHMARKs: VC: 70.067s VO: 0.074s A: 0.000s Sys: 0.617s = 70.758s
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=10 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 69.828s VO: 0.072s A: 0.000s Sys: 0.605s = 70.506s
BENCHMARKs: VC: 71.838s VO: 0.073s A: 0.000s Sys: 0.629s = 72.539s
BENCHMARKs: VC: 71.903s VO: 0.074s A: 0.000s Sys: 0.634s = 72.611s
BENCHMARKs: VC: 72.563s VO: 0.073s A: 0.000s Sys: 0.626s = 73.262s
BENCHMARKs: VC: 72.373s VO: 0.073s A: 0.000s Sys: 0.653s = 73.099s
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 64.130s VO: 0.074s A: 0.000s Sys: 0.641s = 64.845s
BENCHMARKs: VC: 65.372s VO: 0.074s A: 0.000s Sys: 0.665s = 66.111s
BENCHMARKs: VC: 65.493s VO: 0.075s A: 0.000s Sys: 0.640s = 66.208s
BENCHMARKs: VC: 66.321s VO: 0.076s A: 0.000s Sys: 0.629s = 67.026s
BENCHMARKs: VC: 66.202s VO: 0.075s A: 0.000s Sys: 0.642s = 66.919s
Here is also the result for FF_IDCT_SIMPLE (just C implementation with no assembly) for comparison:
# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=2 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 71.117s VO: 0.072s A: 0.000s Sys: 0.622s = 71.811s
BENCHMARKs: VC: 72.435s VO: 0.072s A: 0.000s Sys: 0.598s = 73.105s
BENCHMARKs: VC: 72.576s VO: 0.073s A: 0.000s Sys: 0.663s = 73.312s
BENCHMARKs: VC: 73.364s VO: 0.074s A: 0.000s Sys: 0.660s = 74.098s
BENCHMARKs: VC: 73.304s VO: 0.073s A: 0.000s Sys: 0.637s = 74.014s
So the fastest idct for Nokia 770 is FF_IDCT_SIMPLEARMV5TE (number 16), it has some optimizations using armv5te dsp instructions (single cycle 16 x 16 bit multiplication). It is also the default setting for any cpu that supports armv5te instructions in mplayer 1.0rc1 now. This code is the first revision and most likely can be optimized even more. Also the overall results difference because of using different idct implementations use may vary for different video files, I observed performance improvement of up to 10% (on high bitrate but low resolution movies). For this particular file we see that the improvement is only about 6%.
A strange thing here in these benchmarks is that the results are a bit nonconsistent and decoding time slightly increases with each new cycle iteration.
It would be very interesting to see some benchmark results from Zaurus to see which idct works best for it. MPlayer and ffmpeg don't have any iwmmxt optimized idct right now (and it could provide some improvement as it should be able to do two 16 x 16 bit multiplications per cycle).
So more benchmarks are welcome, preferably using the same test file. Or you can suggest some other sample for testing. Also after running these benchmarks, we can see how big is the performance difference between Nokia 770 and Zaurus hardware, which also might be interesting to know
-
Hi Serge!
I conducted a bunch of benchmark tests using a Zaurus C3000 running pdaXii13 build4 full which includes Meanies build of mplayer 1.0rc1 (which he has named the binary mplayer3) and I used the same Doom divx clip that you linked in all the tests with the same command you used.
For these first four sets of benchmarks the Z was running at the standard 416Mhz setting and the commands were run under an X11 terminal:
------------------------------
idct7:
BENCHMARKs: VC: 58.484s VO: 0.088s A: 0.000s Sys: 2.460s = 61.032s
BENCHMARKs: VC: 57.614s VO: 0.070s A: 0.000s Sys: 0.848s = 58.531s
BENCHMARKs: VC: 57.865s VO: 0.075s A: 0.000s Sys: 0.842s = 58.781s
BENCHMARKs: VC: 57.753s VO: 0.078s A: 0.000s Sys: 0.851s = 58.682s
BENCHMARKs: VC: 57.837s VO: 0.074s A: 0.000s Sys: 0.835s = 58.746s
idct10:
BENCHMARKs: VC: 59.045s VO: 0.072s A: 0.000s Sys: 2.366s = 61.483s
BENCHMARKs: VC: 59.071s VO: 0.070s A: 0.000s Sys: 0.989s = 60.130s
BENCHMARKs: VC: 59.188s VO: 0.071s A: 0.000s Sys: 0.859s = 60.118s
BENCHMARKs: VC: 59.163s VO: 0.071s A: 0.000s Sys: 0.855s = 60.089s
BENCHMARKs: VC: 59.157s VO: 0.070s A: 0.000s Sys: 0.838s = 60.065s
idct16:
BENCHMARKs: VC: 54.462s VO: 0.124s A: 0.000s Sys: 2.615s = 57.201s
BENCHMARKs: VC: 57.047s VO: 0.078s A: 0.000s Sys: 2.020s = 59.145s
BENCHMARKs: VC: 56.930s VO: 0.072s A: 0.000s Sys: 1.586s = 58.588s
BENCHMARKs: VC: 53.739s VO: 0.072s A: 0.000s Sys: 0.859s = 54.670s
BENCHMARKs: VC: 53.948s VO: 0.070s A: 0.000s Sys: 1.672s = 55.690s
idct2:
BENCHMARKs: VC: 59.714s VO: 0.070s A: 0.000s Sys: 2.524s = 62.308s
BENCHMARKs: VC: 61.109s VO: 0.074s A: 0.000s Sys: 1.822s = 63.005s
BENCHMARKs: VC: 60.556s VO: 0.071s A: 0.000s Sys: 0.879s = 61.506s
BENCHMARKs: VC: 60.216s VO: 0.070s A: 0.000s Sys: 0.847s = 61.133s
BENCHMARKs: VC: 60.157s VO: 0.070s A: 0.000s Sys: 0.898s = 61.125s
----------------------------
For the next four sets benchmarks I overclocked to 624Mhz and quit out of X11 and ran the command under the console for max performance:
idct7:
BENCHMARKs: VC: 37.560s VO: 0.072s A: 0.000s Sys: 2.349s = 39.981s
BENCHMARKs: VC: 38.063s VO: 0.049s A: 0.000s Sys: 0.561s = 38.673s
BENCHMARKs: VC: 38.066s VO: 0.050s A: 0.000s Sys: 0.563s = 38.679s
BENCHMARKs: VC: 38.078s VO: 0.050s A: 0.000s Sys: 0.560s = 38.688s
BENCHMARKs: VC: 38.081s VO: 0.050s A: 0.000s Sys: 0.559s = 38.690s
idct10:
BENCHMARKs: VC: 36.988s VO: 0.050s A: 0.000s Sys: 0.562s = 37.600s
BENCHMARKs: VC: 38.759s VO: 0.049s A: 0.000s Sys: 0.559s = 39.368s
BENCHMARKs: VC: 38.770s VO: 0.050s A: 0.000s Sys: 0.563s = 39.382s
BENCHMARKs: VC: 38.718s VO: 0.050s A: 0.000s Sys: 0.560s = 39.328s
BENCHMARKs: VC: 38.736s VO: 0.049s A: 0.000s Sys: 0.559s = 39.344s
idct16:
BENCHMARKs: VC: 33.716s VO: 0.050s A: 0.000s Sys: 0.567s = 34.333s
BENCHMARKs: VC: 35.310s VO: 0.049s A: 0.000s Sys: 0.559s = 35.919s
BENCHMARKs: VC: 35.401s VO: 0.050s A: 0.000s Sys: 0.563s = 36.014s
BENCHMARKs: VC: 35.281s VO: 0.050s A: 0.000s Sys: 0.560s = 35.891s
BENCHMARKs: VC: 35.354s VO: 0.049s A: 0.000s Sys: 0.559s = 35.962s
idct2:
BENCHMARKs: VC: 37.474s VO: 0.050s A: 0.000s Sys: 0.565s = 38.088s
BENCHMARKs: VC: 39.184s VO: 0.049s A: 0.000s Sys: 0.560s = 39.793s
BENCHMARKs: VC: 39.344s VO: 0.050s A: 0.000s Sys: 0.564s = 39.957s
BENCHMARKs: VC: 39.183s VO: 0.050s A: 0.000s Sys: 0.560s = 39.793s
BENCHMARKs: VC: 39.253s VO: 0.049s A: 0.000s Sys: 0.560s = 39.863s
--------------------
So, just as on the 770 it would seem idct16 is clearly the fastest
-
I ran the benchmark on my ipaq h2200 (400MHz pxa255) and I can see that the memory bus is a bottleneck, since the 770 and pxa270 machines run the bus at a higher speed.
If that isn't the case, arm926 cores kick xscale ass
root@h2200:/data# sh doom-test.sh
idct is 2
BENCHMARKs: VC: 82.432s VO: 0.071s A: 0.000s Sys: 1.293s = 83.796s
BENCHMARKs: VC: 80.798s VO: 0.066s A: 0.000s Sys: 0.916s = 81.780s
BENCHMARKs: VC: 80.758s VO: 0.067s A: 0.000s Sys: 0.912s = 81.737s
BENCHMARKs: VC: 80.676s VO: 0.070s A: 0.000s Sys: 0.897s = 81.643s
BENCHMARKs: VC: 80.649s VO: 0.067s A: 0.000s Sys: 0.950s = 81.665s
idct is 7
BENCHMARKs: VC: 75.593s VO: 0.069s A: 0.000s Sys: 0.902s = 76.564s
BENCHMARKs: VC: 78.993s VO: 0.069s A: 0.000s Sys: 0.903s = 79.965s
BENCHMARKs: VC: 79.248s VO: 0.066s A: 0.000s Sys: 0.933s = 80.246s
BENCHMARKs: VC: 79.242s VO: 0.067s A: 0.000s Sys: 0.931s = 80.239s
BENCHMARKs: VC: 79.080s VO: 0.066s A: 0.000s Sys: 0.904s = 80.050s
idct is 10
BENCHMARKs: VC: 77.020s VO: 0.067s A: 0.000s Sys: 0.905s = 77.992s
BENCHMARKs: VC: 80.152s VO: 0.066s A: 0.000s Sys: 0.905s = 81.124s
BENCHMARKs: VC: 80.219s VO: 0.181s A: 0.000s Sys: 0.903s = 81.303s
BENCHMARKs: VC: 80.238s VO: 0.066s A: 0.000s Sys: 1.024s = 81.328s
BENCHMARKs: VC: 80.359s VO: 0.066s A: 0.000s Sys: 0.906s = 81.331s
idct is 16
BENCHMARKs: VC: 73.140s VO: 0.068s A: 0.000s Sys: 0.916s = 74.124s
BENCHMARKs: VC: 76.616s VO: 0.066s A: 0.000s Sys: 1.014s = 77.695s
BENCHMARKs: VC: 76.927s VO: 0.066s A: 0.000s Sys: 0.905s = 77.899s
BENCHMARKs: VC: 76.992s VO: 0.069s A: 0.000s Sys: 0.906s = 77.966s
BENCHMARKs: VC: 77.157s VO: 0.067s A: 0.000s Sys: 0.940s = 78.165s
-
Thanks for running benchmarks. They show that these armv5te optimizations for idct are useful for xscale too. I was just unsure if it is possible to develop a shared code that runs fine on both arm926 and xscale or have to implement two different versions. I'll try to optimize this idct further as much as possible primarily for arm926, but will also keep in mind that this code is also useful on xscale and will take this into account Anyway, iwmmxt implementation of idct specifically optimized for xscale may be a better choice (idct takes quite a noticeable fraction of decoding time, so it is at least useful for some machines like zaurus C3000). If anybody skilled with arm assembly would like to try it, I could provide some help with information (but I don't have any machine that can run iwmmxt code anyway).
I ran the benchmark on my ipaq h2200 (400MHz pxa255) and I can see that the memory bus is a bottleneck, since the 770 and pxa270 machines run the bus at a higher speed.
That's interesting. If memory performance is really very important for mplayer, probably it should be possible to find the parts of code with heavy memory use and optimize memory access patterns for better cache and memory bus utilization. I have already done some tests trying to figure out how to make best use of memory bandwidth on Nokia 770 some time ago: http://maemo.org/pipermail/maemo-developer...ber/006579.html (http://maemo.org/pipermail/maemo-developers/2006-December/006579.html)
This information can turn out to be very useful for further optimizations
If that isn't the case, arm926 cores kick xscale ass
Well, arm926 core should be somewhat faster per clock, here are some links to optimization docs for different arm flavours: http://www.internettablettalk.com/forums/s...read.php?t=2406 (http://www.internettablettalk.com/forums/showthread.php?t=2406)
But I expected that 416MHz should be still a lot faster because of higher cpu clock frequency. Maybe memory performance is really a limiting factor here and it makes performance of all these chips closer to each other.
Another possible explanation could be nonoptimal set of optimization options or older version of gcc for zaurus builds of mplayer. It should be relatively easy to test mplayer with a different set of optimization options. You can take upstream mplayer 1.0rc1 tarball and compile it using:
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
It may have some problems with video/audio output drivers if compiled without zaurus specific patches, but this should not be a problem for testing decoding capabilities only
-
The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.
That seems a bit too low, I wonder if mplayer was configured and compiled correctly. The point is that motion compensation code in mplayer is currently much better optimized for iwmmxt (that all work was done by atty). You can just look into mplayer sources.
Here is the code used for ARM without iwmmx (libavcodec/armv4l/dsputil_arm.c):
/* c->put_pixels_tab[0][0] = put_pixels16_arm; */ // NG!
c->put_pixels_tab[0][1] = put_pixels16_x2_arm; //OK!
c->put_pixels_tab[0][2] = put_pixels16_y2_arm; //OK!
/* c->put_pixels_tab[0][3] = put_pixels16_xy2_arm; /\* NG *\/ */
/* c->put_no_rnd_pixels_tab[0][0] = put_pixels16_arm; */
c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_arm; // OK
c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_arm; //OK
/* c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_arm; //NG */
c->put_pixels_tab[1][0] = put_pixels8_arm; //OK
c->put_pixels_tab[1][1] = put_pixels8_x2_arm; //OK
/* c->put_pixels_tab[1][2] = put_pixels8_y2_arm; //NG */
/* c->put_pixels_tab[1][3] = put_pixels8_xy2_arm; //NG */
c->put_no_rnd_pixels_tab[1][0] = put_pixels8_arm;//OK
c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_arm; //OK
c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_arm; //OK
/* c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_arm;//NG */
Compare it with the following (libavcodec/armv4l/dsputil_iwmmxt.c):
c->put_pixels_tab[0][0] = put_pixels16_iwmmxt;
c->put_pixels_tab[0][1] = put_pixels16_x2_iwmmxt;
c->put_pixels_tab[0][2] = put_pixels16_y2_iwmmxt;
c->put_pixels_tab[0][3] = put_pixels16_xy2_iwmmxt;
c->put_no_rnd_pixels_tab[0][0] = put_pixels16_iwmmxt;
c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_iwmmxt;
c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_iwmmxt;
c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_iwmmxt;
c->put_pixels_tab[1][0] = put_pixels8_iwmmxt;
c->put_pixels_tab[1][1] = put_pixels8_x2_iwmmxt;
c->put_pixels_tab[1][2] = put_pixels8_y2_iwmmxt;
c->put_pixels_tab[1][3] = put_pixels8_xy2_iwmmxt;
c->put_no_rnd_pixels_tab[1][0] = put_pixels8_iwmmxt;
c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_iwmmxt;
c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_iwmmxt;
c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_iwmmxt;
c->avg_pixels_tab[0][0] = avg_pixels16_iwmmxt;
c->avg_pixels_tab[0][1] = avg_pixels16_x2_iwmmxt;
c->avg_pixels_tab[0][2] = avg_pixels16_y2_iwmmxt;
c->avg_pixels_tab[0][3] = avg_pixels16_xy2_iwmmxt;
c->avg_no_rnd_pixels_tab[0][0] = avg_pixels16_iwmmxt;
c->avg_no_rnd_pixels_tab[0][1] = avg_no_rnd_pixels16_x2_iwmmxt;
c->avg_no_rnd_pixels_tab[0][2] = avg_no_rnd_pixels16_y2_iwmmxt;
c->avg_no_rnd_pixels_tab[0][3] = avg_no_rnd_pixels16_xy2_iwmmxt;
c->avg_pixels_tab[1][0] = avg_pixels8_iwmmxt;
c->avg_pixels_tab[1][1] = avg_pixels8_x2_iwmmxt;
c->avg_pixels_tab[1][2] = avg_pixels8_y2_iwmmxt;
c->avg_pixels_tab[1][3] = avg_pixels8_xy2_iwmmxt;
c->avg_no_rnd_pixels_tab[1][0] = avg_no_rnd_pixels8_iwmmxt;
c->avg_no_rnd_pixels_tab[1][1] = avg_no_rnd_pixels8_x2_iwmmxt;
c->avg_no_rnd_pixels_tab[1][2] = avg_no_rnd_pixels8_y2_iwmmxt;
c->avg_no_rnd_pixels_tab[1][3] = avg_no_rnd_pixels8_xy2_iwmmxt;
As you see, machines that support iwmmxt have all the motion compensation related functions implemented in hand optimized assembly. It is strange that it only results in about 2% improvement.
The c7x0 models would benefit from people helping the libw100 project.
I see, but I can't provide any help here as I don't have any hardware but Nokia 770, more people interested in improving mplayer performance on different ARM devices are welcome here
I can only do assembly optimizations for ffmpeg using armv5te instruction set (including fast single cycle multiply dsp instructions).
Concerning the current progress, I have done some modification to valgrind (callgrind part) to make it simulate read-allocate cache behaviour (arm926 uses such cache) and now have some information about parts of code that cause many cache missed and do lots of work with the memory.
Things that may need optimizations and provide some improvement are:- idct
- motion compensation (for non iwmmxt devices)
- dct_unquantize_h263_intra function (it contains almost 7% of instructions executed from callgrind statistics for this Doom video fragment, in addition it contains lots of multiplications which can be accelerated using dsp instructions), one more proof that it is needed to be optimized is that x86 code also contains mmx version of this function
Also I can prepare some small test programs for synthetic benchmarking of all these parts of code (idct, motion compensation, unquantize) so that it will be easier to see if there is any effect of optimizations. It is hard to notice any substantial effects of each one of these optimizations when just monitoring full video decoding time, but they all are cumulative and all together can provide quite a visible improvement. I have already done something like this when tried to optimize idct code (not too successful attempt because it focused on the code that was not real bottleneck, rows processing in idct generally takes much less time than columns):
http://lists.mplayerhq.hu/pipermail/ffmpeg...ber/045837.html (http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2006-September/045837.html)
Would anyone want to try running these benchmarks, or take some more active part in optimizing mplayer/ffmpeg?
PS. By the way, is it possible to watch that Doom video clip without (much) framedrops on nonoverclocked Zaurus?
-
Hi Serge!
I'm willing to do some more benchmarking if it will assist mplayer ARM development
-
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math"
There is no "-O4". Maximum optimization is -O3. And be careful with it. Sometimes it is better to use -O2 or even -Os for performance... If you do more optimization - binary grows lager.... And -fomit-frame-pointer is enabled in -O, -O2, -O3, -Os
On ARM version of GCC there is a little difference (acording to man gcc) betwen -mcpu=iwmmxt and -mtune=iwmmxt. So for max. performance it is good to use both.
http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...ptimize-Options (http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Optimize-Options.html#Optimize-Options)
http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...tml#ARM-Options (http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/ARM-Options.html#ARM-Options)
-mtune=name
This option is very similar to the -mcpu= option, except that instead of specifying the actual target processor type, and hence restricting which instructions can be used, it specifies that GCC should tune the performance of the code as if the target were of the type specified in this option, but still choosing the instructions that it will generate based on the cpu specified by a -mcpu= option. For some ARM implementations better performance can be obtained by using this option.
-
civil: http://www.hpc.ru/board/viewtopic.php?t=99079&start=10 (http://www.hpc.ru/board/viewtopic.php?t=99079&start=10)
Please read my old reply to the same your old question in Russian. I tried to use some online web translator, but the result is not very much readable: http://www.online-translator.com/url/tran_...=0&psubmit2.y=0 (http://www.online-translator.com/url/tran_url.asp?lang=en&url=http%3A%2F%2Fwww.hpc.ru%2Fboard%2Fviewtopic.php%3Ft%3D99079%26start%3D10&direction=re&template=General&cp1=NO&cp2=NO&autotranslate=on&psubmit2.x=0&psubmit2.y=0)
Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results. Unfortunately you did not provide any benchmarks even after you have been asked for it. I would appreciate if we keep discussion constructive and friendly here and don't start discussing some theoretical matters about how gcc is supposed to work. Thanks.
-
Yeah Civil, be civil
(Sorry, couldn't resist )
-
Serge
It was just comments... I don't know english so well to make correct senteces, so I write as I can...
Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results.
I'll try to compile mplyaer 1.0 rc1 with different options:
1) -O2 -mtune=iwmmxt -mcpu=iwmmxt
2) -O3 -mtune=iwmmxt -mcpu=iwmmxt
3) -O3 -mtune=iwmmxt -mcpu=iwmmxt -fomit-frame-pointer
and maybe with others. It depends on time wich it'll take to compile mplayer on Z. And then I'll post becnhmark results here, in this post. And then I'll post results wich I've got using mplayer from cacko.
-
Done some patch for 'dct_unquantize_h263_intra' function today:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050356.html (http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-January/050356.html)
It should be useful for armv5te devices which do not have iwmmxt support (for Nokia 770 and probably for XScale chips older than PXA27x). This dct_unquantize_h263_intra function takes about 7% of decoding for Doom.xvid trailer, optimizing this function provides a visible performance improvement at least for this particular video file
Probably it can be optimized even more and a better final version of this patch will be available a bit later.
-
OK, committed 'dct_unquantize_h263_intra' optimization to maemo mplayer svn. It would be interesting to see the results of running 'test-unquantize' test program to benchmark how it behaves on XScale. Some details about the results from Nokia 770 are here: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html (http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-January/050363.html)
Here are some step by step instructions:
1. Checkout maemo mplayer svn: 'svn co https://garage.maemo.org/svn/mplayer/trunk (https://garage.maemo.org/svn/mplayer/trunk) maemo-mplayer'
2. Go to 'maemo-mplayer/libavcodec/tests'
3. Compile the test program using supplied makefile (you will need to set CC and CFLAGS variables according to the name of your compiler and preferred optimizations settings), you can check 'build-tests-n770.sh' as an example of settings for compiling this test program for Nokia 770 (using crosscompiler from gentoo crossdev)
4. Run test program on your device and post the results here
This optimization may be useful for PXA255 or other XScale chips that do not have iwmmx support (do I understand that correctly?). This 'dct_unquantize_h263' function also has iwmmxt optimized implementation in mplayer and it should be used on the latest xscale chips (and SIMD instructions from iwmmxt should be much better for this kind of code). By the way, absence of iwmmxt support could also explain very poor results from PXA255 box provided by koen. Can somebody investigate what's the matter as not everything is clear yet?
-
Well, some more optimizations for h263 unquantizer, I think it is a final version and it is hardly possible to optimize it more (for armv5te)
Test from Nokia 770:
/media/mmc1 $ ./test-unquantize
no cpu clock frequency specified, trying to autodetect it...
... detected as 251.2MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.07063 usec per element, or 17.7 cycles (251.2MHz)
dct_unquantize_h263_special_helper_armv5te time=0.02692 usec per element, or 6.8 cycles (251.2MHz)
I wonder how it performs on XScale per clock as loads are now done as 64-bits at a time using LDRD instruction (see my previous post about the details how to run the test).
PS. Thanks to koen for running previous benchmark, it showed that assembly optimized code for dct_unquantize_h263 is also roughly 2x faster than gcc generated code on XScale. But it would be interesting to see some results with this final patch.
Edit: Result for 400MHz XScale cpu (from koen):
root@h2200:/data/site/mplayer/libavcodec/tests# ./test-unquantize 400; ./test-unquantize
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.04329 usec per element, or 17.3 cycles (400.0MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01671 usec per element, or 6.7 cycles (400.0MHz)
no cpu clock frequency specified, trying to autodetect it...
... detected as 376.7MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.04277 usec per element, or 16.1 cycles (376.7MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01655 usec per element, or 6.2 cycles (376.7MHz)
-
Just for additional statistics, 'Doom benchmark' for Nokia N800 (keep in mind that MPlayer is not optimized for ARMv6 SIMD instructions at all right now, so these results have a good potential for improving):
mplayer -benchmark -lavdopts idct=16 -nosound -vo null -loop 5 -quiet Doom.divx
BENCHMARKs: VC: 47.556s VO: 0.069s A: 0.000s Sys: 0.634s = 48.259s
BENCHMARKs: VC: 48.413s VO: 0.071s A: 0.000s Sys: 0.618s = 49.101s
BENCHMARKs: VC: 48.561s VO: 0.073s A: 0.000s Sys: 0.593s = 49.228s
BENCHMARKs: VC: 48.731s VO: 0.072s A: 0.000s Sys: 0.624s = 49.427s
BENCHMARKs: VC: 49.398s VO: 0.072s A: 0.000s Sys: 0.633s = 50.102s
-
Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml (http://www.gentoo.org/proj/en/base/embedded/cross-development.xml)
As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized
-
Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml (http://www.gentoo.org/proj/en/base/embedded/cross-development.xml)
As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized
[div align=\"right\"][a href=\"index.php?act=findpost&pid=151461\"][{POST_SNAPBACK}][/a][/div]
There are several flavours of Zaurus OS which all have different hard/soft float requirements. The default Sharp ROM (and also Cacko ROM) use hardfloat. The pdaXrom distribution for Zaurus uses softvfp. OZ (OpenZaurus) uses yet another variant of softfloat...
The latest builds of mplayer rc1 were mainly build for pdaXrom.
-
Here is a new progress update report I have implemented an initial version of JIT accelerated scaler for planar YUV420 -> packed YUV422 color format. It provides a very nice performance improvement for Nokia 770 already in a new mplayer build for maemo: mplayer_1.0rc1-maemo.8 (https://garage.maemo.org/frs/?group_id=54)
I will try to get this code integrated into upstream ffmpeg library so that other ARM devices (such as PXA270?) could make use of it and have all the performance problems with scaling solved. Here is a link with some more information, it also includes benchmark results (using the same Doom video clip): http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/051209.html (http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-January/051209.html)
-
Serge,
I'll build your comparison benchmarks for the PXA255 (and SA1110 if it's of interest) once I've got over some minor (I hope) OE build issues.
Si
-
Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you.
Just Haven't got enough time for tests (exams...).
Default compiler options ( -O4 -pipe -ffasth-math -fomit-frame-pointer ):
BENCHMARKs: VC: 52.561s VO: 0.065s A: 0.000s Sys: 0.793s = 53.419s
BENCHMARKs: VC: 56.284s VO: 0.066s A: 0.000s Sys: 0.795s = 57.145s
BENCHMARKs: VC: 56.476s VO: 0.065s A: 0.000s Sys: 0.797s = 57.339s
BENCHMARKs: VC: 56.319s VO: 0.065s A: 0.000s Sys: 0.796s = 57.180s
BENCHMARKs: VC: 56.434s VO: 0.065s A: 0.000s Sys: 0.799s = 57.290s
-O2 -pipe -march=iwmmxt -mcpu=iwmmxt -mtune=iwmmxt -msoft-float:
BENCHMARKs: VC: 53.703s VO: 0.066s A: 0.000s Sys: 0.915s = 54.685s
BENCHMARKs: VC: 56.455s VO: 0.066s A: 0.000s Sys: 0.803s = 57.324s
BENCHMARKs: VC: 56.513s VO: 0.066s A: 0.000s Sys: 0.799s = 57.377s
BENCHMARKs: VC: 56.458s VO: 0.065s A: 0.000s Sys: 0.798s = 57.322s
BENCHMARKs: VC: 56.456s VO: 0.065s A: 0.000s Sys: 0.800s = 57.321s
P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested (http://www.online-translator.com/url/tran_url.asp?lang=en&url=http%3A%2F%2Fwww.hpc.ru%2Fboard%2Fviewtopic.php%3Ft%3D112652%26postdays%3D0%26postorder%3Dasc%26start%3D0&direction=re&template=General&cp1=NO&cp2=NO&autotranslate=on&transliterate=on&psubmit2.x=75&psubmit2.y=14) ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...)
P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx
P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus...
-
P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested (http://www.online-translator.com/url/tran_url.asp?lang=en&url=http%3A%2F%2Fwww.hpc.ru%2Fboard%2Fviewtopic.php%3Ft%3D112652%26postdays%3D0%26postorder%3Dasc%26start%3D0&direction=re&template=General&cp1=NO&cp2=NO&autotranslate=on&transliterate=on&psubmit2.x=75&psubmit2.y=14) ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...)
Thanks for running these tests. It shows that the results for -O3 (-O4) are pretty much the same as -O2, it would be interesting to compare them against -Os as this option is most commonly used on embedded devices.
By the way, why iwmmxt was not used? It should provide quite a noticeable improvement, at least theoreticaly
P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx
P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus...
[div align=\"right\"][a href=\"index.php?act=findpost&pid=152745\"][{POST_SNAPBACK}][/a][/div]
Thanks, I'm anticipating more test results. While compiler optimization options are unlikely to provide big improvement, but every little bit helps.
-
apologies for straying off topic- I'm actually interested in the mplayer work.
BUT--the I followed the gentoo link in the last post. if progress is being made, it certainly desrves some attention. A mainstream distro like gento that compiles and runs on a Z (well optimized, etc) has been a sort of holy grail for quite a few zaurus users. By all means encourage the people working on the project to post here
[div align=\"right\"][a href=\"index.php?act=findpost&pid=152749\"][{POST_SNAPBACK}][/a][/div]
Wouldn't it be better to create a new topic for discussing gentoo on zaurus? Otherwise we risk to turn this topic into a mess.
-
Wouldn't it be better to create a new topic for discussing gentoo on zaurus? smile.gif Otherwise we risk to turn this topic into a mess.
I'm not discussing... And I'm not a developer, so I think author (Mesk) must post about it. I've posted just basic info for people to know about system I'm running now.
-
Some more mplayer related news, mplayer port for maemo (https://garage.maemo.org/projects/mplayer/) should now be more or less usable on Nokia N800 (http://en.wikipedia.org/wiki/N800) (video freeze issues fixed by using video output code with direct framebuffer access just like on Nokia 770). Once accommodation to this new device is finished, code optimization activity will be resumed
-
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code. I have done this for the results below.
Here are my benchmark results on a standard Sl-C3200, not overclocked, running open zaurus:
BENCHMARKs: VC: 44.056s VO: 0.078s A: 0.000s Sys: 0.831s = 44.965s
BENCHMARK%: VC: 97.9787% VO: 0.1734% A: 0.0000% Sys: 1.8479% = 100.0000%
BENCHMARKs: VC: 43.234s VO: 0.079s A: 0.000s Sys: 0.816s = 44.128s
BENCHMARK%: VC: 97.9734% VO: 0.1785% A: 0.0000% Sys: 1.8481% = 100.0000%
BENCHMARKs: VC: 43.487s VO: 0.076s A: 0.000s Sys: 0.813s = 44.376s
BENCHMARK%: VC: 97.9957% VO: 0.1715% A: 0.0000% Sys: 1.8328% = 100.0000%
BENCHMARKs: VC: 43.669s VO: 0.076s A: 0.000s Sys: 0.820s = 44.565s
BENCHMARK%: VC: 97.9891% VO: 0.1712% A: 0.0000% Sys: 1.8398% = 100.0000%
BENCHMARKs: VC: 43.497s VO: 0.078s A: 0.000s Sys: 0.810s = 44.386s
BENCHMARK%: VC: 97.9976% VO: 0.1764% A: 0.0000% Sys: 1.8260% = 100.0000%
Tim
-
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.
Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?
Did you try using something similar to what I suggested in this thread before (https://www.oesf.org/forums/index.php?showtopic=22280&view=findpost&p=149264)?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
-
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.
Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?
Did you try using something similar to what I suggested in this thread before (https://www.oesf.org/forums/index.php?showtopic=22280&view=findpost&p=149264)?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156266\"][{POST_SNAPBACK}][/a][/div]
Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.
It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?
That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.
Tim
-
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.
Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?
Did you try using something similar to what I suggested in this thread before (https://www.oesf.org/forums/index.php?showtopic=22280&view=findpost&p=149264)?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156266\"][{POST_SNAPBACK}][/a][/div]
Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.
It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?
That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.
Tim
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156267\"][{POST_SNAPBACK}][/a][/div]
if you pull latest source from svn, you can just use --enable-iwmmxt
-
Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.
It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?
That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.
Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html (http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/2006-October/046537.html)
Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream. If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems.
PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1
edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all.
And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start
-
Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html (http://lists.mplayerhq.hu/pipermail/mplayer-dev-eng/2006-October/046537.html)
Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream.
[\quote]
I already did this stuff yesteday, before I saw your messages. Yes Meanie, even latest SVN does not fix matters. I posted a patch to the ffmpeg dev mailing list, got some feedback and posted another patch. Am awaiting the response.
If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems.
PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1
edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all.
And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156280\"][{POST_SNAPBACK}][/a][/div]
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase. So either the version which was in use was pretty good (be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output), or the system is memory bound as others have suggested. It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)
Here are the new results:
BENCHMARKs: VC: 43.497s
BENCHMARKs: VC: 42.813s
BENCHMARKs: VC: 43.040s
BENCHMARKs: VC: 43.269s
BENCHMARKs: VC: 43.090s
Thanks,
Tim
-
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.
That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.
This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html (http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-January/050363.html)
So either the version which was in use was pretty good
It was just not performance critical, I wonder why you even managed to see some improvement
(be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output),
Actually I find compiler generated code for ARM quite poorly optimized. It can't make the good use of conditionally executed instructions, can't use DSP instructions, schedule code in an optimal way to avoid pipeline stalls. Of course, it only makes sense optimizing code that is bottleneck to gain any visible performance improvement overall.
I prefer to always develop some simple performance and correctness tests for the performance critical functions I'm optimizing. So I can ensure that they really provide performance improvement and do not introduce stability issues.
Random assembly hacking is not a productive way of working for sure
or the system is memory bound as others have suggested.
This particular function is run on fully cached data, so memory access time is not important here. I investigated mplayer memory access pattern using valgrind (callgrind tool) getting more or less precise information about cache misses.
Code that heavily depends on memory performance is in motion compensation functions and partially idct (cache write misses for destination buffer).
It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)
Yes, paying special attention at accessing memory properly and using prefetch can improve performance quite noticeably.
PS. In order to ensure that video is decoded not only fast, but also right, you can use '-vo md5' option. I noticed some really ugly video decoding artefacts when using standard ARM optimized IDCT (some vertical stripes on panning scenes), ARMv5TE optimized IDCT is identical to C implementation.
-
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.
That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.
I think it does, as I know the cacko mplayer-atty is faster again than "mine", and that uses the IPP stuff for idct. I was not really interested in trying it though, due to the license restrictions of IPP.
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.
This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770:
On thing I'm going to do is compare the iwmmxt code against your armv5te code, performance wise.
Cheers,
Tim
-
actually, i think your new build is much faster than atty's in decoding speed.
here is the benchmarks result of running atty's iwmmxt optimized build of mplayer on C3000 with pdaXrom
BENCHMARKs: VC: 40.385s VO: 0.068s A: 0.000s Sys: 0.863s = 41.315s
BENCHMARKs: VC: 47.495s VO: 0.067s A: 0.000s Sys: 0.860s = 48.421s
BENCHMARKs: VC: 45.600s VO: 0.067s A: 0.000s Sys: 0.843s = 46.509s
BENCHMARKs: VC: 45.629s VO: 0.068s A: 0.000s Sys: 0.865s = 46.562s
BENCHMARKs: VC: 45.820s VO: 0.068s A: 0.000s Sys: 0.859s = 46.748s
for comparison, here is the benchmark results of the SVN mplayer code with armv5te enabled and xscale tuning CC flags
BENCHMARKs: VC: 52.105s VO: 0.026s A: 0.000s Sys: 1.047s = 53.178s
BENCHMARKs: VC: 53.503s VO: 0.027s A: 0.000s Sys: 0.923s = 54.453s
BENCHMARKs: VC: 54.030s VO: 0.027s A: 0.000s Sys: 0.914s = 54.970s
BENCHMARKs: VC: 53.926s VO: 0.027s A: 0.000s Sys: 0.931s = 54.883s
BENCHMARKs: VC: 53.267s VO: 0.034s A: 0.000s Sys: 0.927s = 54.228s
-
On cacko on c1000, I see:
VC: 36.186
VC: 36.927
VC: 37.662
VC: 36.932
VC: 37.016
And similar figures for sys. Cacko uses attys mplayer, which still seems to be the best by quite a margin!
At a guess this is due to IPP for IDCT.
Thanks,
Tim
-
You can try to override idct by using '-lavdopts idct=<some_number>' in atty's build and test it. After getting the numbers we can see if it is really IPP that matters, or maybe atty's build has some other optimizations.
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
-
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156919\"][{POST_SNAPBACK}][/a][/div]
Except that ARM has no immediate assignments and needs aligned data...
-
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156919\"][{POST_SNAPBACK}][/a][/div]
Except that ARM has no immediate assignments
MMX instruction set does not have immediate assignments either In any case, that's not a big deal.
and needs aligned data...
FFmpeg does special care for alignment, many functions have guaranteed alignment specified for the data they are processing (some SSE instructions require 16-byte alignment after all, so ARM is not the most strict in this respect). Input data for IDCT is also 16-byte aligned for example, that's more than enough for ARM
Anyway, somebody just needs to give it a try. To encourage you more and prove that it might work, looks like atty took the existing MMX implementation of dct_unquantize_h263_intra_mmx and converted it to dct_unquantize_h263_intra_iwmmxt Probably he did not care about IDCT as he could just use IPP instead, so maybe doing a conversion from MMX to IWMMXT for IDCT is also possible with not so much work (everything is relative of course). I wonder what implementation would be faster? On one hand IPP is a library developed by professionals from Intel, on the other hand FFmpeg proved to be very well optimized beating many other codecs on x86 platform and default IDCT used in it is MMX optimized.
-
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156919\"][{POST_SNAPBACK}][/a][/div]
Except that ARM has no immediate assignments
MMX instruction set does not have immediate assignments either In any case, that's not a big deal.
and needs aligned data...
FFmpeg does special care for alignment, many functions have guaranteed alignment specified for the data they are processing (some SSE instructions require 16-byte alignment after all, so ARM is not the most strict in this respect). Input data for IDCT is also 16-byte aligned for example, that's more than enough for ARM
[div align=\"right\"][a href=\"index.php?act=findpost&pid=157000\"][{POST_SNAPBACK}][/a][/div]
Right, o-hand ported the fbmmx layer in the xserver to iwmmx but it wasn't faster since you had to align the data by hand. Maybe ffmpeg can gain more.
-
You can try to override idct by using '-lavdopts idct=<some_number>' in atty's build and test it. After getting the numbers we can see if it is really IPP that matters, or maybe atty's build has some other optimizations.
I did try it, and using the non-IPP IDCT produces results which are comparable ish. atty mplayer is still faster by 10% or so, so there are still a few more tweaks I need to sort out, but it was 40% better when using ipp.
Cheers,
Tim
-
Hi, I'm working on further optimizing ARMv5 IDCT for mplayer/ffmpeg. Older implementation from mplayer 1.0rc1 was only optimized for ARM9E cores. Now it should get noticeably faster on long pipeline cores such as XScale (Sharp Zaurus) and ARM11 (Nokia N800).
Can anybody compile and run the following test on XScale:
> svn checkout https://garage.maemo.org/svn/mplayer/trunk/libavcodec (https://garage.maemo.org/svn/mplayer/trunk/libavcodec)
> cd libavcodec/tests
> make test-idct
You may need to specify the name of your crosscompiler when running make (ex. 'CC="arm-softfloat-linux-gnueabi-gcc" make test-idct')
After that please copy 'test-idct' bunary to your device and run it specifying cpu clock frequency in the command line (for 416MHz Zaurus it would be './test-idct --freq=416')
For those who are curious, here are the results from running this test on Nokia 770:
> ./test-idct --freq=252
Assuming cpu clock frequency 252MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=886.0
simple_idct_put_armv5te cache=no, time=1062.2
simple_idct_put_armv5te cache=yes, time=1032.8
simple_idct_add_armv5te cache=no, time=1323.7
simple_idct_add_armv5te cache=yes, time=1186.2
simple_idct_armv5te_ref time=1041.8
simple_idct_put_armv5te_ref cache=no, time=1257.6
simple_idct_put_armv5te_ref cache=yes, time=1253.0
simple_idct_add_armv5te_ref cache=no, time=1561.9
simple_idct_add_armv5te_ref cache=yes, time=1445.6
--- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1423.4
simple_idct_put_armv5te cache=no, time=1665.7
simple_idct_put_armv5te cache=yes, time=1655.3
simple_idct_add_armv5te cache=no, time=1934.6
simple_idct_add_armv5te cache=yes, time=1783.8
simple_idct_armv5te_ref time=1698.6
simple_idct_put_armv5te_ref cache=no, time=1914.0
simple_idct_put_armv5te_ref cache=yes, time=1911.6
simple_idct_add_armv5te_ref cache=no, time=2221.2
simple_idct_add_armv5te_ref cache=yes, time=2098.9
Results for Nokia N800:
> ./test-idct --freq=330 --enable-armv6
Assuming cpu clock frequency 330MHz (ARMv6 enabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=751.3
simple_idct_put_armv5te cache=no, time=947.7
simple_idct_put_armv5te cache=yes, time=866.9
simple_idct_add_armv5te cache=no, time=1099.2
simple_idct_add_armv5te cache=yes, time=937.6
simple_idct_armv5te_ref time=1084.5
simple_idct_put_armv5te_ref cache=no, time=1288.4
simple_idct_put_armv5te_ref cache=yes, time=1280.5
simple_idct_add_armv5te_ref cache=no, time=1538.2
simple_idct_add_armv5te_ref cache=yes, time=1397.9
simple_idct_armv6 time=762.4
simple_idct_put_armv6 cache=no, time=1034.9
simple_idct_put_armv6 cache=yes, time=765.4
simple_idct_add_armv6 cache=no, time=1063.2
simple_idct_add_armv6 cache=yes, time=903.2
--- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1220.0
simple_idct_put_armv5te cache=no, time=1413.3
simple_idct_put_armv5te cache=yes, time=1355.4
simple_idct_add_armv5te cache=no, time=1576.0
simple_idct_add_armv5te cache=yes, time=1417.2
simple_idct_armv5te_ref time=1872.0
simple_idct_put_armv5te_ref cache=no, time=2079.6
simple_idct_put_armv5te_ref cache=yes, time=2081.5
simple_idct_add_armv5te_ref cache=no, time=2342.7
simple_idct_add_armv5te_ref cache=yes, time=2190.1
simple_idct_armv6 time=1138.9
simple_idct_put_armv6 cache=no, time=1426.7
simple_idct_put_armv6 cache=yes, time=1144.8
simple_idct_add_armv6 cache=no, time=1444.1
simple_idct_add_armv6 cache=yes, time=1281.9
Test results from XScale are needed to check if my assumptions are correct (I used ARM9E, ARM11 and XScale manuals for reference to write code that works the best on all these CPUs, but could only test it on Nokia 770 and N800). Theoretically, results from XScale should be very similar to the results from Nokia N800 (ARM11). Lower numbers are better (that is time for running IDCT in cpu cycles). Functions with '_ref' suffix belong to the reference armv5te optimized idct implementation from mplayer 1.0rc1
If anybody want to build an optimized mplayer, you need to download this file (https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/armv4l/simple_idct_armv5te.S?root=mplayer&view=markup) and replace simple_idct_armv5te.S in your mplayer sources.
-
I'll see if I can give it a try.
How much is this likely to speed up MPlayer, or is that what you're trying to determine?
-
I'll see if I can give it a try.
How much is this likely to speed up MPlayer, or is that what you're trying to determine?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=164913\"][{POST_SNAPBACK}][/a][/div]
IDCT usually takes 20-40% of video decoding time. There will be no huge overall speedup, but the improvement should be quite noticeable (IDCT itself becomes up to 1.5x faster on ARM11). The goal is to reduce performance difference from the mplayer compiled with IPP (see a previous tjchick's post) and possibly beat it
The best results can be achieved by using IWMMX instructions though. But some older cores do not support IWMMX (PXA255 for example) and a tweaked ARMv5TE IDCT would be handy there. Also IWMMX optimized IDCT still needs to be written and this ARMv5TE IDCT can serve as a placeholder until then.
-
pxa270, 416MHz (Zaurus C3100), Gentoo 2007.0, eabi.
Assuming cpu clock frequency 416MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=751.9
simple_idct_put_armv5te cache=no, time=1988.0
simple_idct_put_armv5te cache=yes, time=860.2
simple_idct_add_armv5te cache=no, time=1136.2
simple_idct_add_armv5te cache=yes, time=923.1
simple_idct_armv5te_ref time=1131.8
simple_idct_put_armv5te_ref cache=no, time=1297.1
simple_idct_put_armv5te_ref cache=yes, time=1281.0
simple_idct_add_armv5te_ref cache=no, time=1625.5
simple_idct_add_armv5te_ref cache=yes, time=1385.5
--- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1168.7
simple_idct_put_armv5te cache=no, time=2281.7
simple_idct_put_armv5te cache=yes, time=1277.0
simple_idct_add_armv5te cache=no, time=1595.2
simple_idct_add_armv5te cache=yes, time=1340.3
simple_idct_armv5te_ref time=1821.7
simple_idct_put_armv5te_ref cache=no, time=1988.0
simple_idct_put_armv5te_ref cache=yes, time=1981.6
simple_idct_add_armv5te_ref cache=no, time=2326.5
simple_idct_add_armv5te_ref cache=yes, time=2084.4
-
pxa270, 416MHz (Zaurus C3100), Gentoo 2007.0, eabi.
...
Thanks for running this test. Almost all is just as I expected, XScale pipeline is really very similar to ARM11. Number crunching part of IDCT is now ~1.5x faster ('simple_idct_armv5te' vs. 'simple_idct_armv5te_ref'). Also everything is very fast if we don't take memory performance into account and all the memory accesses hit cache.
But generally we are interested in performance of functions 'simple_idct_put_armv5te' and 'simple_idct_add_armv5te' when the results get stored into memory and that memory region is not in the cache. Everything is fine with 'simple_idct_add_armv5te' and it really got quite a lot faster. But there seems to be an unexpected problem with 'simple_idct_put_armv5te'. Probably write buffer (some temporary storage in cpu for memory writes that bypass cache) overflows and XScale pipeline stalls resulting in a very bad performance. When 'simple_idct_put_armv5te' stores results into memory region which is in cache, it works very fast. I'll try to tweak the code a bit and will ask you to rerun this test a bit later.
Thanks again for running the test, if we did not check this code on XScale before its submission to ffmpeg, performance on XScale would be not too good (don't know how it would affect overall results as 'simple_idct_add_armv5te' would speed up and 'simple_idct_put_armv5te' would slow down).
Anyway, after the code gets fixed for XScale, I think we can expect something like 5-10% of overall video decoding improvement on it (depending on video file).
-
I'm sorry for a long delay with an answer. Could you try to run this idct test on XScale again? I believe that this performance regression for 'simple_idct_put_armv5te' should be fixed now.
-
Any improvement at all is very much welcomed - I hope that these optimisations will make it into Angstrom as soon as proven and stable!
-
Any improvement at all is very much welcomed - I hope that these optimisations will make it into Angstrom as soon as proven and stable!
Well, I'm maintaining mplayer package for maemo and have some good stuff already which I would like to contribute to ffmpeg. I'm only posting some test code sample here to ensure that these my submissions will not cause any regressions on XScale and will not hurt you This code has already proven very useful on Nokia internet tablets, and most likely will be good for Zaurus too. But nobody knows for sure and so it is better to test everything (as the test done by Civil proved). Having XScale device for testing would be useful for finetuning code for better performance and probably even trying IWMMX optimizations, but I'm not sure if I want to spend 400-500 euro on just one more toy. Maybe if somebody could lend me XScale powered linux PDA for a few weekends, everything would be much easier and faster
By the way, here are the latest synthetic benchmarks of ARMv5TE optimized IDCT (SVN revision 249) on Nokia N800 as its ARM11 cpu is similar to XScale:
$ ./test-idct --freq=330
Assuming cpu clock frequency 330MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te time=685.8
simple_idct_put_armv5te cache=no, time=780.4
simple_idct_put_armv5te cache=yes, time=770.0
simple_idct_add_armv5te cache=no, time=984.9
simple_idct_add_armv5te cache=yes, time=853.3
simple_idct_add_pf_pld_armv5te cache=no, time=940.9
simple_idct_add_pf_pld_armv5te cache=yes, time=863.1
simple_idct_add_pf_ldr_armv5te cache=no, time=958.3
simple_idct_add_pf_ldr_armv5te cache=yes, time=862.5
simple_idct_armv5te_ref time=1088.1
simple_idct_put_armv5te_ref cache=no, time=1286.2
simple_idct_put_armv5te_ref cache=yes, time=1282.9
simple_idct_add_armv5te_ref cache=no, time=1518.2
simple_idct_add_armv5te_ref cache=yes, time=1393.9
--- benchmarking with random idct coefficients ---
simple_idct_armv5te time=1147.0
simple_idct_put_armv5te cache=no, time=1240.9
simple_idct_put_armv5te cache=yes, time=1233.8
simple_idct_add_armv5te cache=no, time=1467.0
simple_idct_add_armv5te cache=yes, time=1317.2
simple_idct_add_pf_pld_armv5te cache=no, time=1403.5
simple_idct_add_pf_pld_armv5te cache=yes, time=1366.2
simple_idct_add_pf_ldr_armv5te cache=no, time=1438.8
simple_idct_add_pf_ldr_armv5te cache=yes, time=1341.3
simple_idct_armv5te_ref time=1872.6
simple_idct_put_armv5te_ref cache=no, time=2065.1
simple_idct_put_armv5te_ref cache=yes, time=2064.9
simple_idct_add_armv5te_ref cache=no, time=2308.4
simple_idct_add_armv5te_ref cache=yes, time=2179.2
Also here is a more real test with matrixbench_normdivx_vbrmp3.avi video clip from http://samples.mplayerhq.hu/benchmark/testsuite1/ (http://samples.mplayerhq.hu/benchmark/testsuite1/)
Benchmark with current IDCT:
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 135.127s VO: 0.163s A: 0.000s Sys: 1.387s = 136.677s
BENCHMARKs: VC: 132.337s VO: 0.153s A: 0.000s Sys: 1.382s = 133.872s
BENCHMARKs: VC: 133.986s VO: 0.148s A: 0.000s Sys: 1.351s = 135.485s
BENCHMARKs: VC: 134.576s VO: 0.174s A: 0.000s Sys: 1.351s = 136.102s
BENCHMARKs: VC: 132.979s VO: 0.161s A: 0.000s Sys: 1.387s = 134.527s
BENCHMARKs: VC: 132.987s VO: 0.145s A: 0.000s Sys: 1.408s = 134.539s
BENCHMARKs: VC: 132.945s VO: 0.150s A: 0.000s Sys: 1.394s = 134.489s
BENCHMARKs: VC: 132.248s VO: 0.152s A: 0.000s Sys: 1.353s = 133.753s
BENCHMARKs: VC: 131.673s VO: 0.152s A: 0.000s Sys: 1.366s = 133.191s
BENCHMARKs: VC: 132.138s VO: 0.149s A: 0.000s Sys: 1.370s = 133.656s
BENCHMARKs: VC: 132.536s VO: 0.144s A: 0.000s Sys: 1.364s = 134.044s
BENCHMARKs: VC: 132.332s VO: 0.148s A: 0.000s Sys: 1.329s = 133.810s
Benchmark with the new optimized IDCT (after replacing 'simple_idct_armv5te.S' and recompiling mplayer):
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 122.543s VO: 0.162s A: 0.000s Sys: 1.416s = 124.120s
BENCHMARKs: VC: 120.901s VO: 0.152s A: 0.000s Sys: 1.371s = 122.424s
BENCHMARKs: VC: 122.490s VO: 0.147s A: 0.000s Sys: 1.338s = 123.975s
BENCHMARKs: VC: 124.826s VO: 0.151s A: 0.000s Sys: 1.325s = 126.302s
BENCHMARKs: VC: 123.052s VO: 0.143s A: 0.000s Sys: 1.393s = 124.588s
BENCHMARKs: VC: 121.897s VO: 0.146s A: 0.000s Sys: 1.366s = 123.409s
BENCHMARKs: VC: 122.406s VO: 0.139s A: 0.000s Sys: 1.359s = 123.903s
BENCHMARKs: VC: 123.448s VO: 0.150s A: 0.000s Sys: 1.381s = 124.979s
BENCHMARKs: VC: 119.141s VO: 0.143s A: 0.000s Sys: 1.360s = 120.644s
BENCHMARKs: VC: 120.555s VO: 0.147s A: 0.000s Sys: 1.340s = 122.042s
BENCHMARKs: VC: 120.686s VO: 0.141s A: 0.000s Sys: 1.377s = 122.203s
BENCHMARKs: VC: 120.902s VO: 0.143s A: 0.000s Sys: 1.358s = 122.402s
It really confirms video decoding speedup in the range 5-10% as estimated earlier. It is interesting to see how it will work on XScale. Also it would be very interesting to compare performance of this IDCT implementation to the one from IPP to check which one is faster now and how much?
-
A zaurus C3200 px27x
Before new idct
mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 209.368s VO: 0.168s A: 0.000s Sys: 3.011s = 212.547s
BENCHMARKs: VC: 213.062s VO: 0.170s A: 0.000s Sys: 3.022s = 216.253s
BENCHMARKs: VC: 214.726s VO: 0.169s A: 0.000s Sys: 3.039s = 217.935s
BENCHMARKs: VC: 214.936s VO: 0.170s A: 0.000s Sys: 2.674s = 217.780s
BENCHMARKs: VC: 215.113s VO: 0.170s A: 0.000s Sys: 3.182s = 218.464s
BENCHMARKs: VC: 215.065s VO: 0.170s A: 0.000s Sys: 2.618s = 217.853s
BENCHMARKs: VC: 215.700s VO: 0.170s A: 0.000s Sys: 2.611s = 218.482s
BENCHMARKs: VC: 215.293s VO: 0.170s A: 0.000s Sys: 2.606s = 218.069s
BENCHMARKs: VC: 215.575s VO: 0.170s A: 0.000s Sys: 2.621s = 218.366s
BENCHMARKs: VC: 215.655s VO: 0.169s A: 0.000s Sys: 2.608s = 218.433s
BENCHMARKs: VC: 215.323s VO: 0.170s A: 0.000s Sys: 2.614s = 218.107s
BENCHMARKs: VC: 215.373s VO: 0.170s A: 0.000s Sys: 2.610s = 218.153s
After new idct
mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 203.236s VO: 0.169s A: 0.000s Sys: 2.651s = 206.056s
BENCHMARKs: VC: 207.844s VO: 0.170s A: 0.000s Sys: 2.641s = 210.654s
BENCHMARKs: VC: 207.917s VO: 0.171s A: 0.000s Sys: 2.633s = 210.722s
BENCHMARKs: VC: 207.760s VO: 0.170s A: 0.000s Sys: 2.634s = 210.564s
BENCHMARKs: VC: 207.879s VO: 0.172s A: 0.000s Sys: 2.617s = 210.668s
BENCHMARKs: VC: 207.367s VO: 0.170s A: 0.000s Sys: 2.635s = 210.172s
BENCHMARKs: VC: 208.025s VO: 0.170s A: 0.000s Sys: 2.629s = 210.824s
BENCHMARKs: VC: 207.421s VO: 0.170s A: 0.000s Sys: 2.623s = 210.213s
BENCHMARKs: VC: 207.879s VO: 0.170s A: 0.000s Sys: 2.618s = 210.667s
BENCHMARKs: VC: 207.960s VO: 0.171s A: 0.000s Sys: 2.635s = 210.765s
BENCHMARKs: VC: 207.909s VO: 0.170s A: 0.000s Sys: 2.628s = 210.707s
BENCHMARKs: VC: 207.877s VO: 0.170s A: 0.000s Sys: 2.627s = 210.675s
-
OK, thanks, so at least this IDCT optimization is useful on Zaurus too. I'll try to submit it upstream soon, so that we would all have it in mplayer 1.0rc2 whenever it gets released
But video performance on Zaurus looks quitey bad according to this benchmark, hence significantly lower relative effect of IDCT optimization. Poor performance is partially caused by IWMMXT optimizations not getting enabled in the default mplayer 1.0rc1 sources because of a bug. Also earlier in this thread we got benchmarks from atty's build of mplayer and it had a much better performance. A large part of this improvement was considered to be introduced by the use of IPP. But IPP only provides IDCT acceleration and IDCT looks to be quite fast already (if 1.5x IDCT performance improvement results in 7-8 seconds of difference, the whole IDCT probably takes no more than 30 seconds of all the decoding time). Even if IPP magically reduced IDCT overhead to zero, there is still too much time wasted somewhere remaining. Maybe it is still a good idea to try to find the source of this performance bottleneck and fix it once and for all (submitting all the relevant patches to upstream mplayer/ffmpeg)?
There was an idea about slow memory causing performance problems. But memory performance (both bandwidth and latency) can be easily benchmarked.
Also could I/O performance (reading from flash memory or HDD) affect video decoding time so much on Zaurus?. In this case putting some video clip in ramdisk should eliminate this factor.
-
could there be other factors affecting memory access - ensuring the right word size is used and not just aligning on the correct byte boundary, reading the data in the right size chunks to make best of use of any arm caching and pre-fetch logic in the CPU?
the Z has got quite poor CF speed as (so I understand) it shares bus cycles with the main memory, the SD slot is faster because it's a separate bus off the CPU.
-
Todays SVN mplayer with rev 257 of IDCT code produces there benchmarks.
Im thinking this is an improvement.
Machine is SL-C3200 Zaurus
mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 186.320s VO: 0.064s A: 0.000s Sys: 2.718s = 189.103s
BENCHMARKs: VC: 188.632s VO: 0.065s A: 0.000s Sys: 3.130s = 191.827s
BENCHMARKs: VC: 188.897s VO: 0.065s A: 0.000s Sys: 2.742s = 191.704s
BENCHMARKs: VC: 189.111s VO: 0.065s A: 0.000s Sys: 2.710s = 191.886s
BENCHMARKs: VC: 188.934s VO: 0.065s A: 0.000s Sys: 2.699s = 191.698s
BENCHMARKs: VC: 189.177s VO: 0.064s A: 0.000s Sys: 2.727s = 191.968s
BENCHMARKs: VC: 188.932s VO: 0.064s A: 0.000s Sys: 2.725s = 191.721s
BENCHMARKs: VC: 189.237s VO: 0.064s A: 0.000s Sys: 2.705s = 192.007s
BENCHMARKs: VC: 188.937s VO: 0.066s A: 0.000s Sys: 2.707s = 191.709s
BENCHMARKs: VC: 189.076s VO: 0.065s A: 0.000s Sys: 2.717s = 191.857s
BENCHMARKs: VC: 189.161s VO: 0.065s A: 0.000s Sys: 2.713s = 191.939s
BENCHMARKs: VC: 189.101s VO: 0.065s A: 0.000s Sys: 2.721s = 191.887s
-
Did some benchmarks today in different environments.
Same command line as above, same video clip, mplayer 1.0 rc2 24587-r5, built by XorA (from Angstrom iwmmxt feed) gave about 181 seconds minimal time.
atty's mplayer on Cacko ROM gave about 162 seconds (same command line, same clip).
TCPMP on a iPAQ 4700 (624MHz PXA270 CPU) gave "228%" in benchmark mode, which translates, I think, to 187.64/2.28=82.3 seconds.
Serge, perhaps TCPMP is worth looking as well? As far as I know, it is open-source.
-
Took a look at tcpmp sources this evening. ffmpeg sources were not modified (except palmOS-specific hacks), so I believe the big speed difference is just because hx4700 is a quite fast device, or mplayer does something terribly wrong (?).
Aside from this, nothing interesting in tcpmp sources. Just a collection of codecs from various sources and the glue code. Lots of custom assembly for fast blitting and scaling.
-
Hello zap,
Please also try testing atty's build without '-lavdopts idct=16' option (it forces armv5te optimized idct from ffmpeg, but atty's build should be able to use a more efficient iwmmxt optimized idct from IPP).
Anyway, as already mentioned in this thread, there is something wrong with mplayer running on Zaurus devices (or the devices with XScale core). For example, even Nokia 770 with 252MHz ARM9E cpu appears to be faster than Zaurus when playing this matrix video clip (time for decoding it is ~158 seconds). Though intuitively everything should be quite the opposite: Zaurus has a lot higher cpu clock frequency and supports iwmmxt SIMD instructions in addition to armv5te.
TCPMP might be an interesting option (for somebody else to try), but I'm satistied with mplayer/ffmpeg on Nokia 770 and N800 at the moment. Translating mplayer performance on Nokia 770 to 'TCMP percents', it would be something like 118%, and if we try to estimate how it would theoretically run at 624MHz, that would be ~290%. I know that this approximation is wrong as memory speed also does matter a lot, but anyway, looks like both TCPMP and ffmpeg should provide at least comparable performance.
In order to get optimal mplayer performance on Zaurus, somebody just needs to profile it there (doing it with gprof is quite simple), find performance bottlenecks and try to fix them. I might have a look at what's wrong if I got XScale device to experiment with (I had plans to buy some motorola EZX phone, A1200 or E6, but these plans are on hold now).
-
Please also try testing atty's build without '-lavdopts idct=16' option (it forces armv5te optimized idct from ffmpeg, but atty's build should be able to use a more efficient iwmmxt optimized idct from IPP).
I ran it without this option since I thought atty's mplayer uses the right idct transform by default. By the way, Angstrom' mplayer also seems to use the best idct transform by default, at least I haven't noticed any difference when running mplayer with and without this option.
Anyway, as already mentioned in this thread, there is something wrong with mplayer running on Zaurus devices (or the devices with XScale core). For example, even Nokia 770 with 252MHz ARM9E cpu appears to be faster than Zaurus when playing this matrix video clip (time for decoding it is ~158 seconds). Though intuitively everything should be quite the opposite: Zaurus has a lot higher cpu clock frequency and supports iwmmxt SIMD instructions in addition to armv5te.
Tried the same clip on TCPMP on my old Dell Axim X5 (400MHz PXA255, 64Mb RAM). It shows 131.68% so indeed it looks like something is wrong on Zaurus, because my Dell Axim has a 100MHz bus and C3100 has AFAIK 143MHz bus (e.g. faster RAM).
In order to get optimal mplayer performance on Zaurus, somebody just needs to profile it there (doing it with gprof is quite simple), find performance bottlenecks and try to fix them. I might have a look at what's wrong if I got XScale device to experiment with (I had plans to buy some motorola EZX phone, A1200 or E6, but these plans are on hold now).
I'll try to do that when time permits.
-
Just a quick update from me, mostly of interest to the angstrom people...
You may remember I hacked mplayer/ffmpeg to actually use iwmmxt rather than just compiling them.
I got VC times of apx 43 seconds for the doom clip running on angstrom.
Now I am using *the same binary* and get VC times of 37 seconds on the latest Angstrom test images, so something has changed, maybe cache support or iwmmxt support in the kernel. Anyhow, my results are now about the same as my tests on cacko with attys mplayer.
If I use the default mplayer included in the angstrom iwmmxt feeds, I see VC of 52 seconds. I'm going to take a look, and try the svn version.
Cheers,
Tim
-
good news indeed, anything which improves the media performance on the zaurus is great!