![]() ![]() |
Dec 5 2006, 02:43 PM
Post
#1
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Probably it is a good idea to consolidate efforts and try to submit some of the useful ARM related patches upstream:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ust/014460.html http://lists.mplayerhq.hu/pipermail/mplaye...ber/046207.html I can only test MPlayer on Nokia 770, so can't be sure if any ARM9E (that's the core used in Nokia 770) specific optimizations are also good for Zaurus. So people who are able to compile MPlayer from sources and test it on zaurus are welcome in this thread. One of the examples is the new armv5te optimized idct in MPlayer 1.0rc1, can anybody benchmark it on Zaurus? Also this is not quite ARM architecture related, but libmad based decoder in MPlayer seems to have troubles with variable bitrate audio (it loses sync with video). Some more details can be found here http://lists.mplayerhq.hu/pipermail/mplaye...ust/045017.html and in the followup messages. Any volunteer to investigate this problem? All in all, ffmpeg optimizations for ARM are not nearly as good as for x86, so investing some time in it may provide some performance improvement. |
|
|
|
Dec 6 2006, 09:14 AM
Post
#2
|
|
|
Group: Members Posts: 10 Joined: 29-November 06 Member No.: 12,936 |
I second that a better player would be great
Im a noob with linux but if I can help in one way or another I would be pleased to see you |
|
|
|
Dec 7 2006, 09:34 AM
Post
#3
|
|
|
Group: Members Posts: 69 Joined: 16-May 06 From: France, Metz Member No.: 9,882 |
Hi!
Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code. Cheers, Ludo. |
|
|
|
Dec 7 2006, 09:53 AM
Post
#4
|
|
![]() Group: Members Posts: 1,014 Joined: 4-January 05 From: Enschede, The Netherlands Member No.: 6,107 |
|
|
|
|
Dec 7 2006, 10:54 AM
Post
#5
|
|
![]() Group: Members Posts: 1,156 Joined: 5-January 05 From: Winnipeg, Manitoba Member No.: 6,127 |
|
|
|
|
Dec 7 2006, 11:06 AM
Post
#6
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(ldrolez @ Dec 7 2006, 09:34 AM) Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code. Well, that's very good. Can anybody verify that this iwmmx code works correctly and submit everything that is usable upstream? If it is already there, can you confirm that it is really in a good shape? I know that some of the atty's code was committed to upstream mplayer source tree (you can check SVN changelog), but I doubt that anyone tested it. The check for iwmmx availability was only added to MPlayer configure script in 1.0rc1 release. So up until this last release, it was not usable without additional patches. Speaking of iwmmx optimizations, idct code still does not use iwmmx in MPlayer at all, and it is one of the most performance critical parts of code. Only the last MPlayer release got armv5te optimized idct, which was optimized according to http://www.arm.com/pdfs/DDI0222B_9EJS_r1p2.pdf (ARM9E instruction timings). As far as I know, it was developed and tested for Nokia 770 and it really improved mpeg4 decoding performance for about 10%. Most likely this code is not very good for XScale, as XScale has a much more complicated pipeline with lots of interlocks if code is not arranged as it likes (see http://download.intel.com/design/intelxscale/27347302.pdf). I wonder if some 'blended' idct code can be developed or it is better to have separate implementations for ARM9E and XScale. Anyway, it needs to be benchmarked first before making any decisions. In addition, Zaurus builds of MPlayer seem to use some additional modules for hardware accelerated video output. I wonder if it is a good idea to contribute them upstream? MPlayer seems to have special video output code for some old 3dfx and matrox video cards, I doubt that zaurus specific video output code is something that is more exotic and not worth being supported upstream |
|
|
|
Dec 7 2006, 02:29 PM
Post
#7
|
|
![]() Group: Members Posts: 1,014 Joined: 4-January 05 From: Enschede, The Netherlands Member No.: 6,107 |
|
|
|
|
Dec 7 2006, 02:45 PM
Post
#8
|
|
![]() Group: Members Posts: 682 Joined: 26-December 05 From: Rochdale, Lancashire Member No.: 8,789 |
I'm very happy to learn that the ARM specific parts of mplayer are being actively developed so please keep us updated on its progress serge.
Antikx: I agree! I don't think any event in the world of OSS and computer hardware can escape the all pervading attention of the supreme tech oracle that is koen- seriously! I think that man must have embedded RSS,email and web browser in his head that he can monitor and post to even when asleep |
|
|
|
Dec 11 2006, 12:30 PM
Post
#9
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(danboid @ Dec 7 2006, 02:45 PM) I'm very happy to learn that the ARM specific parts of mplayer are being actively developed so please keep us updated on its progress serge. Well, 'actively developed' is a gross overestimation Anyway, further optimizations for decoder are still needed. That is if we want to at least make an attempt of getting proper playback support for nonconverted video |
|
|
|
Dec 25 2006, 02:30 AM
Post
#10
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Just to keep you informed, the work on implementing MPlayer video output driver with hardware YUV support for Nokia 770 is more or less finished. At least it is in usable state now.
But in order to get good performance for any video resolutions, optimized YV12->YUY2 scaler is still needed on Nokia 770. By the way, how does Zaurus handle video scaling? Is it hardware accelerated or a software scaler is used? If it is software scaler, what YUV format is used for output? Here is some mplayer log console output from Nokia 770 (video is software scaled to 400x210 and then hardware pixel doubling is used to show it fullscreen as 800x420): CODE VO: [nokia770] 336x176 => 336x176 Planar YV12 [fs] SwScaler: reducing / aligning filtersize 2 -> 2 SwScaler: reducing / aligning filtersize 2 -> 2 SwScaler: reducing / aligning filtersize 2 -> 2 SwScaler: reducing / aligning filtersize 2 -> 2 SwScaler: FAST_BILINEAR scaler, from yuv420p to yuyv422 using C SwScaler: using FAST_BILINEAR C scaler for horizontal scaling SwScaler: using 2-tap linear C scaler for vertical scaling (BGR) SwScaler: 336x176 -> 400x210 What do you usually observe on your Zaurus? |
|
|
|
Dec 25 2006, 04:16 AM
Post
#11
|
|
![]() Group: Members Posts: 1,014 Joined: 4-January 05 From: Enschede, The Netherlands Member No.: 6,107 |
QUOTE(Serge @ Dec 25 2006, 10:30 AM) By the way, how does Zaurus handle video scaling? Is it hardware accelerated or a software scaler is used? If it is software scaler, what YUV format is used for output? That depends on the models, but basically: * collie: no acceleration at all * poodle: ditto * c7x0: ati imageon w100 which can do limited scaling, YUV transform and idct (http://libw100.sf.net/) * cxxxx: pxa270fb, which doesn't do scaling AFAIK, but can do YUV transforms and has a small amount of SRAM to do faster blitting when using QVGA. The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement. The c7x0 models would benefit from people helping the libw100 project. |
|
|
|
Dec 25 2006, 05:38 AM
Post
#12
|
|
![]() Group: Members Posts: 1,014 Joined: 4-January 05 From: Enschede, The Netherlands Member No.: 6,107 |
QUOTE(koen @ Dec 25 2006, 12:16 PM) The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement. The c7x0 models would benefit from people helping the libw100 project. 'XorA' in #oe on irc.freenode.net is our resident mplayer guru and 'sirfred' the w100 guru. |
|
|
|
Dec 26 2006, 03:53 PM
Post
#13
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Some information about mplayer benchmarking. It contains -benchmark option which can measure time spent for decoding video, displaying video (including scaling and color conversion) and audio.
One of the options that affect decoding performance is idct implemntation. It can be specified by using -lavdopts idct=# where # is some decimal number. MPlayer man contains the following information: CODE idct=<0-99> IDCT algorithm NOTE: To the best of our knowledge all these IDCTs do pass the IEEE1180 tests. 0 Automatically select a good one (default). 1 JPEG reference integer 2 simple 3 simplemmx 4 libmpeg2mmx (inaccurate, do not use for encoding with keyint >100) 5 ps2 6 mlib 7 arm 8 AltiVec 9 sh4 But man pages are a bit incomplete and more information can be found in libavcodec/avcodec.h: CODE #define FF_IDCT_AUTO 0 #define FF_IDCT_INT 1 #define FF_IDCT_SIMPLE 2 #define FF_IDCT_SIMPLEMMX 3 #define FF_IDCT_LIBMPEG2MMX 4 #define FF_IDCT_PS2 5 #define FF_IDCT_MLIB 6 #define FF_IDCT_ARM 7 #define FF_IDCT_ALTIVEC 8 #define FF_IDCT_SH4 9 #define FF_IDCT_SIMPLEARM 10 #define FF_IDCT_H264 11 #define FF_IDCT_VP3 12 #define FF_IDCT_IPP 13 #define FF_IDCT_XVIDMMX 14 #define FF_IDCT_CAVS 15 #define FF_IDCT_SIMPLEARMV5TE 16 The following idct implementations can be interesting on ARM: #define FF_IDCT_ARM 7 (default idct that was used for ARM) #define FF_IDCT_SIMPLEARM 10 #define FF_IDCT_SIMPLEARMV5TE 16 (recently added in mplayer 1.0rc1) In order to benchmark video decoding I used the following video clip (10MB version, MD5=1d62b8819bf1433df0dc9b5257f4fc35). Direct link is here: http://trailers.divx.com/Universal/Doom.divx It does not matter which video to take, my only concern was that it should be freely downloadable in order to be able to compare results from different machines. My setup is MPlayer 1.0rc1, Nokia 770 (ARM926EJS 250MHz), gcc version 3.4.4 (release) (CodeSourcery ARM 2005q3-2), configured with CFLAGS="-O4 -mcpu=arm926ej-s -fomit-frame-pointer -ffast-math" # mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=7 Doom.divx | grep BENCHMARKs BENCHMARKs: VC: 67.369s VO: 0.075s A: 0.000s Sys: 0.600s = 68.043s BENCHMARKs: VC: 69.296s VO: 0.075s A: 0.000s Sys: 0.630s = 70.001s BENCHMARKs: VC: 69.346s VO: 0.075s A: 0.000s Sys: 0.622s = 70.044s BENCHMARKs: VC: 70.332s VO: 0.074s A: 0.000s Sys: 0.674s = 71.080s BENCHMARKs: VC: 70.067s VO: 0.074s A: 0.000s Sys: 0.617s = 70.758s # mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=10 Doom.divx | grep BENCHMARKs BENCHMARKs: VC: 69.828s VO: 0.072s A: 0.000s Sys: 0.605s = 70.506s BENCHMARKs: VC: 71.838s VO: 0.073s A: 0.000s Sys: 0.629s = 72.539s BENCHMARKs: VC: 71.903s VO: 0.074s A: 0.000s Sys: 0.634s = 72.611s BENCHMARKs: VC: 72.563s VO: 0.073s A: 0.000s Sys: 0.626s = 73.262s BENCHMARKs: VC: 72.373s VO: 0.073s A: 0.000s Sys: 0.653s = 73.099s # mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx | grep BENCHMARKs BENCHMARKs: VC: 64.130s VO: 0.074s A: 0.000s Sys: 0.641s = 64.845s BENCHMARKs: VC: 65.372s VO: 0.074s A: 0.000s Sys: 0.665s = 66.111s BENCHMARKs: VC: 65.493s VO: 0.075s A: 0.000s Sys: 0.640s = 66.208s BENCHMARKs: VC: 66.321s VO: 0.076s A: 0.000s Sys: 0.629s = 67.026s BENCHMARKs: VC: 66.202s VO: 0.075s A: 0.000s Sys: 0.642s = 66.919s Here is also the result for FF_IDCT_SIMPLE (just C implementation with no assembly) for comparison: # mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=2 Doom.divx | grep BENCHMARKs BENCHMARKs: VC: 71.117s VO: 0.072s A: 0.000s Sys: 0.622s = 71.811s BENCHMARKs: VC: 72.435s VO: 0.072s A: 0.000s Sys: 0.598s = 73.105s BENCHMARKs: VC: 72.576s VO: 0.073s A: 0.000s Sys: 0.663s = 73.312s BENCHMARKs: VC: 73.364s VO: 0.074s A: 0.000s Sys: 0.660s = 74.098s BENCHMARKs: VC: 73.304s VO: 0.073s A: 0.000s Sys: 0.637s = 74.014s So the fastest idct for Nokia 770 is FF_IDCT_SIMPLEARMV5TE (number 16), it has some optimizations using armv5te dsp instructions (single cycle 16 x 16 bit multiplication). It is also the default setting for any cpu that supports armv5te instructions in mplayer 1.0rc1 now. This code is the first revision and most likely can be optimized even more. Also the overall results difference because of using different idct implementations use may vary for different video files, I observed performance improvement of up to 10% (on high bitrate but low resolution movies). For this particular file we see that the improvement is only about 6%. A strange thing here in these benchmarks is that the results are a bit nonconsistent and decoding time slightly increases with each new cycle iteration. It would be very interesting to see some benchmark results from Zaurus to see which idct works best for it. MPlayer and ffmpeg don't have any iwmmxt optimized idct right now (and it could provide some improvement as it should be able to do two 16 x 16 bit multiplications per cycle). So more benchmarks are welcome, preferably using the same test file. Or you can suggest some other sample for testing. Also after running these benchmarks, we can see how big is the performance difference between Nokia 770 and Zaurus hardware, which also might be interesting to know |
|
|
|
Dec 27 2006, 12:36 AM
Post
#14
|
|
![]() Group: Members Posts: 682 Joined: 26-December 05 From: Rochdale, Lancashire Member No.: 8,789 |
Hi Serge!
I conducted a bunch of benchmark tests using a Zaurus C3000 running pdaXii13 build4 full which includes Meanies build of mplayer 1.0rc1 (which he has named the binary mplayer3) and I used the same Doom divx clip that you linked in all the tests with the same command you used. For these first four sets of benchmarks the Z was running at the standard 416Mhz setting and the commands were run under an X11 terminal: ------------------------------ idct7: BENCHMARKs: VC: 58.484s VO: 0.088s A: 0.000s Sys: 2.460s = 61.032s BENCHMARKs: VC: 57.614s VO: 0.070s A: 0.000s Sys: 0.848s = 58.531s BENCHMARKs: VC: 57.865s VO: 0.075s A: 0.000s Sys: 0.842s = 58.781s BENCHMARKs: VC: 57.753s VO: 0.078s A: 0.000s Sys: 0.851s = 58.682s BENCHMARKs: VC: 57.837s VO: 0.074s A: 0.000s Sys: 0.835s = 58.746s idct10: BENCHMARKs: VC: 59.045s VO: 0.072s A: 0.000s Sys: 2.366s = 61.483s BENCHMARKs: VC: 59.071s VO: 0.070s A: 0.000s Sys: 0.989s = 60.130s BENCHMARKs: VC: 59.188s VO: 0.071s A: 0.000s Sys: 0.859s = 60.118s BENCHMARKs: VC: 59.163s VO: 0.071s A: 0.000s Sys: 0.855s = 60.089s BENCHMARKs: VC: 59.157s VO: 0.070s A: 0.000s Sys: 0.838s = 60.065s idct16: BENCHMARKs: VC: 54.462s VO: 0.124s A: 0.000s Sys: 2.615s = 57.201s BENCHMARKs: VC: 57.047s VO: 0.078s A: 0.000s Sys: 2.020s = 59.145s BENCHMARKs: VC: 56.930s VO: 0.072s A: 0.000s Sys: 1.586s = 58.588s BENCHMARKs: VC: 53.739s VO: 0.072s A: 0.000s Sys: 0.859s = 54.670s BENCHMARKs: VC: 53.948s VO: 0.070s A: 0.000s Sys: 1.672s = 55.690s idct2: BENCHMARKs: VC: 59.714s VO: 0.070s A: 0.000s Sys: 2.524s = 62.308s BENCHMARKs: VC: 61.109s VO: 0.074s A: 0.000s Sys: 1.822s = 63.005s BENCHMARKs: VC: 60.556s VO: 0.071s A: 0.000s Sys: 0.879s = 61.506s BENCHMARKs: VC: 60.216s VO: 0.070s A: 0.000s Sys: 0.847s = 61.133s BENCHMARKs: VC: 60.157s VO: 0.070s A: 0.000s Sys: 0.898s = 61.125s ---------------------------- For the next four sets benchmarks I overclocked to 624Mhz and quit out of X11 and ran the command under the console for max performance: idct7: BENCHMARKs: VC: 37.560s VO: 0.072s A: 0.000s Sys: 2.349s = 39.981s BENCHMARKs: VC: 38.063s VO: 0.049s A: 0.000s Sys: 0.561s = 38.673s BENCHMARKs: VC: 38.066s VO: 0.050s A: 0.000s Sys: 0.563s = 38.679s BENCHMARKs: VC: 38.078s VO: 0.050s A: 0.000s Sys: 0.560s = 38.688s BENCHMARKs: VC: 38.081s VO: 0.050s A: 0.000s Sys: 0.559s = 38.690s idct10: BENCHMARKs: VC: 36.988s VO: 0.050s A: 0.000s Sys: 0.562s = 37.600s BENCHMARKs: VC: 38.759s VO: 0.049s A: 0.000s Sys: 0.559s = 39.368s BENCHMARKs: VC: 38.770s VO: 0.050s A: 0.000s Sys: 0.563s = 39.382s BENCHMARKs: VC: 38.718s VO: 0.050s A: 0.000s Sys: 0.560s = 39.328s BENCHMARKs: VC: 38.736s VO: 0.049s A: 0.000s Sys: 0.559s = 39.344s idct16: BENCHMARKs: VC: 33.716s VO: 0.050s A: 0.000s Sys: 0.567s = 34.333s BENCHMARKs: VC: 35.310s VO: 0.049s A: 0.000s Sys: 0.559s = 35.919s BENCHMARKs: VC: 35.401s VO: 0.050s A: 0.000s Sys: 0.563s = 36.014s BENCHMARKs: VC: 35.281s VO: 0.050s A: 0.000s Sys: 0.560s = 35.891s BENCHMARKs: VC: 35.354s VO: 0.049s A: 0.000s Sys: 0.559s = 35.962s idct2: BENCHMARKs: VC: 37.474s VO: 0.050s A: 0.000s Sys: 0.565s = 38.088s BENCHMARKs: VC: 39.184s VO: 0.049s A: 0.000s Sys: 0.560s = 39.793s BENCHMARKs: VC: 39.344s VO: 0.050s A: 0.000s Sys: 0.564s = 39.957s BENCHMARKs: VC: 39.183s VO: 0.050s A: 0.000s Sys: 0.560s = 39.793s BENCHMARKs: VC: 39.253s VO: 0.049s A: 0.000s Sys: 0.560s = 39.863s -------------------- So, just as on the 770 it would seem idct16 is clearly the fastest |
|
|
|
Dec 27 2006, 01:27 AM
Post
#15
|
|
![]() Group: Members Posts: 1,014 Joined: 4-January 05 From: Enschede, The Netherlands Member No.: 6,107 |
I ran the benchmark on my ipaq h2200 (400MHz pxa255) and I can see that the memory bus is a bottleneck, since the 770 and pxa270 machines run the bus at a higher speed.
If that isn't the case, arm926 cores kick xscale ass CODE root@h2200:/data# sh doom-test.sh
idct is 2 BENCHMARKs: VC: 82.432s VO: 0.071s A: 0.000s Sys: 1.293s = 83.796s BENCHMARKs: VC: 80.798s VO: 0.066s A: 0.000s Sys: 0.916s = 81.780s BENCHMARKs: VC: 80.758s VO: 0.067s A: 0.000s Sys: 0.912s = 81.737s BENCHMARKs: VC: 80.676s VO: 0.070s A: 0.000s Sys: 0.897s = 81.643s BENCHMARKs: VC: 80.649s VO: 0.067s A: 0.000s Sys: 0.950s = 81.665s idct is 7 BENCHMARKs: VC: 75.593s VO: 0.069s A: 0.000s Sys: 0.902s = 76.564s BENCHMARKs: VC: 78.993s VO: 0.069s A: 0.000s Sys: 0.903s = 79.965s BENCHMARKs: VC: 79.248s VO: 0.066s A: 0.000s Sys: 0.933s = 80.246s BENCHMARKs: VC: 79.242s VO: 0.067s A: 0.000s Sys: 0.931s = 80.239s BENCHMARKs: VC: 79.080s VO: 0.066s A: 0.000s Sys: 0.904s = 80.050s idct is 10 BENCHMARKs: VC: 77.020s VO: 0.067s A: 0.000s Sys: 0.905s = 77.992s BENCHMARKs: VC: 80.152s VO: 0.066s A: 0.000s Sys: 0.905s = 81.124s BENCHMARKs: VC: 80.219s VO: 0.181s A: 0.000s Sys: 0.903s = 81.303s BENCHMARKs: VC: 80.238s VO: 0.066s A: 0.000s Sys: 1.024s = 81.328s BENCHMARKs: VC: 80.359s VO: 0.066s A: 0.000s Sys: 0.906s = 81.331s idct is 16 BENCHMARKs: VC: 73.140s VO: 0.068s A: 0.000s Sys: 0.916s = 74.124s BENCHMARKs: VC: 76.616s VO: 0.066s A: 0.000s Sys: 1.014s = 77.695s BENCHMARKs: VC: 76.927s VO: 0.066s A: 0.000s Sys: 0.905s = 77.899s BENCHMARKs: VC: 76.992s VO: 0.069s A: 0.000s Sys: 0.906s = 77.966s BENCHMARKs: VC: 77.157s VO: 0.067s A: 0.000s Sys: 0.940s = 78.165s |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 19th June 2013 - 03:00 AM |