Help - Search - Members - Calendar
Full Version: Mplayer Development And Optimization For Arm
OESF Forums > Distros, Development, and Model Specific Forums > Everything Development > Linux Applications
Pages: 1, 2
Serge
Probably it is a good idea to consolidate efforts and try to submit some of the useful ARM related patches upstream:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ust/014460.html
http://lists.mplayerhq.hu/pipermail/mplaye...ber/046207.html

I can only test MPlayer on Nokia 770, so can't be sure if any ARM9E (that's the core used in Nokia 770) specific optimizations are also good for Zaurus. So people who are able to compile MPlayer from sources and test it on zaurus are welcome in this thread. One of the examples is the new armv5te optimized idct in MPlayer 1.0rc1, can anybody benchmark it on Zaurus?

Also this is not quite ARM architecture related, but libmad based decoder in MPlayer seems to have troubles with variable bitrate audio (it loses sync with video). Some more details can be found here http://lists.mplayerhq.hu/pipermail/mplaye...ust/045017.html and in the followup messages. Any volunteer to investigate this problem?

All in all, ffmpeg optimizations for ARM are not nearly as good as for x86, so investing some time in it may provide some performance improvement.
washo
I second that a better player would be great
Im a noob with linux but if I can help in one way or another I would be pleased to

see you laugh.gif
ldrolez
Hi!
Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code.
Cheers,
Ludo.
koen
QUOTE(ldrolez @ Dec 7 2006, 05:34 PM)
Hi!
Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code.
Cheers,
  Ludo.
*


mpeg-video decoder isn't 99% of mplayer-atty, and those bits are in upstream mplayer as well.
Antikx
QUOTE(koen @ Dec 7 2006, 11:53 AM)
mpeg-video decoder isn't 99% of mplayer-atty, and those bits are in upstream mplayer as well.
*

I'm not being sarcastic... it's overwhelming how much you know. I hope you don't start using that power for evil one day. wink.gif
Serge
QUOTE(ldrolez @ Dec 7 2006, 09:34 AM)
Check atty sources, 99% of mplayer for the zaurus is optimized with iwmmx code.

Well, that's very good. Can anybody verify that this iwmmx code works correctly and submit everything that is usable upstream? If it is already there, can you confirm that it is really in a good shape?

I know that some of the atty's code was committed to upstream mplayer source tree (you can check SVN changelog), but I doubt that anyone tested it. The check for iwmmx availability was only added to MPlayer configure script in 1.0rc1 release. So up until this last release, it was not usable without additional patches.

Speaking of iwmmx optimizations, idct code still does not use iwmmx in MPlayer at all, and it is one of the most performance critical parts of code. Only the last MPlayer release got armv5te optimized idct, which was optimized according to http://www.arm.com/pdfs/DDI0222B_9EJS_r1p2.pdf (ARM9E instruction timings). As far as I know, it was developed and tested for Nokia 770 and it really improved mpeg4 decoding performance for about 10%. Most likely this code is not very good for XScale, as XScale has a much more complicated pipeline with lots of interlocks if code is not arranged as it likes (see http://download.intel.com/design/intelxscale/27347302.pdf). I wonder if some 'blended' idct code can be developed or it is better to have separate implementations for ARM9E and XScale. Anyway, it needs to be benchmarked first before making any decisions.

In addition, Zaurus builds of MPlayer seem to use some additional modules for hardware accelerated video output. I wonder if it is a good idea to contribute them upstream? MPlayer seems to have special video output code for some old 3dfx and matrox video cards, I doubt that zaurus specific video output code is something that is more exotic and not worth being supported upstream smile.gif
koen
QUOTE(Serge @ Dec 7 2006, 07:06 PM)
The check for iwmmx availability was only added to MPlayer configure script in 1.0rc1 release. So up until this last release, it was not usable without additional patches.
*


We had patches for that in OE smile.gif I haven't tested it yet, though.
danboid
I'm very happy to learn that the ARM specific parts of mplayer are being actively developed so please keep us updated on its progress serge.

Antikx: I agree! I don't think any event in the world of OSS and computer hardware can escape the all pervading attention of the supreme tech oracle that is koen- seriously! I think that man must have embedded RSS,email and web browser in his head that he can monitor and post to even when asleep biggrin.gif
Serge
QUOTE(danboid @ Dec 7 2006, 02:45 PM)
I'm very happy to learn that the ARM specific parts of mplayer are being actively developed so please keep us updated on its progress serge.

Well, 'actively developed' is a gross overestimation smile.gif I don't think anybody else is working on ARM optimizations for ffmpeg right now. And I currently switched to the development of Nokia 770 hardware accelerated video output code: http://maemo.org/pipermail/maemo-developer...ber/006646.html

Anyway, further optimizations for decoder are still needed. That is if we want to at least make an attempt of getting proper playback support for nonconverted video smile.gif Having to convert everything to 320x240 (or to 400x224 for 16:9) is not much fun. You are lucky to have faster CPU in Zaurus wink.gif
Serge
Just to keep you informed, the work on implementing MPlayer video output driver with hardware YUV support for Nokia 770 is more or less finished. At least it is in usable state now.

But in order to get good performance for any video resolutions, optimized YV12->YUY2 scaler is still needed on Nokia 770. By the way, how does Zaurus handle video scaling? Is it hardware accelerated or a software scaler is used? If it is software scaler, what YUV format is used for output?

Here is some mplayer log console output from Nokia 770 (video is software scaled to 400x210 and then hardware pixel doubling is used to show it fullscreen as 800x420):
CODE
VO: [nokia770] 336x176 => 336x176 Planar YV12 [fs]
SwScaler: reducing / aligning filtersize 2 -> 2
SwScaler: reducing / aligning filtersize 2 -> 2
SwScaler: reducing / aligning filtersize 2 -> 2
SwScaler: reducing / aligning filtersize 2 -> 2

SwScaler: FAST_BILINEAR scaler, from yuv420p to yuyv422 using C
SwScaler: using FAST_BILINEAR C scaler for horizontal scaling
SwScaler: using 2-tap linear C scaler for vertical scaling (BGR)
SwScaler: 336x176 -> 400x210

What do you usually observe on your Zaurus?
koen
QUOTE(Serge @ Dec 25 2006, 10:30 AM)
By the way, how does Zaurus handle video scaling? Is it hardware accelerated or a software scaler is used? If it is software scaler, what YUV format is used for output?
*


That depends on the models, but basically:

* collie: no acceleration at all
* poodle: ditto
* c7x0: ati imageon w100 which can do limited scaling, YUV transform and idct (http://libw100.sf.net/)
* cxxxx: pxa270fb, which doesn't do scaling AFAIK, but can do YUV transforms and has a small amount of SRAM to do faster blitting when using QVGA.

The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.
The c7x0 models would benefit from people helping the libw100 project.
koen
QUOTE(koen @ Dec 25 2006, 12:16 PM)
The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.
The c7x0 models would benefit from people helping the libw100 project.
*


'XorA' in #oe on irc.freenode.net is our resident mplayer guru and 'sirfred' the w100 guru.
Serge
Some information about mplayer benchmarking. It contains -benchmark option which can measure time spent for decoding video, displaying video (including scaling and color conversion) and audio.

One of the options that affect decoding performance is idct implemntation. It can be specified by using -lavdopts idct=# where # is some decimal number. MPlayer man contains the following information:
CODE
      idct=<0-99>
             IDCT algorithm
             NOTE: To the best of our knowledge all these IDCTs do pass the IEEE1180 tests.
                0    Automatically select a good one (default).
                1    JPEG reference integer
                2    simple
                3    simplemmx
                4    libmpeg2mmx (inaccurate, do not use for encoding with keyint >100)
                5    ps2
                6    mlib
                7    arm
                8    AltiVec
                9    sh4


But man pages are a bit incomplete and more information can be found in libavcodec/avcodec.h:
CODE
#define FF_IDCT_AUTO         0
#define FF_IDCT_INT          1
#define FF_IDCT_SIMPLE       2
#define FF_IDCT_SIMPLEMMX    3
#define FF_IDCT_LIBMPEG2MMX  4
#define FF_IDCT_PS2          5
#define FF_IDCT_MLIB         6
#define FF_IDCT_ARM          7
#define FF_IDCT_ALTIVEC      8
#define FF_IDCT_SH4          9
#define FF_IDCT_SIMPLEARM    10
#define FF_IDCT_H264         11
#define FF_IDCT_VP3          12
#define FF_IDCT_IPP          13
#define FF_IDCT_XVIDMMX      14
#define FF_IDCT_CAVS         15
#define FF_IDCT_SIMPLEARMV5TE 16


The following idct implementations can be interesting on ARM:
#define FF_IDCT_ARM 7 (default idct that was used for ARM)
#define FF_IDCT_SIMPLEARM 10
#define FF_IDCT_SIMPLEARMV5TE 16 (recently added in mplayer 1.0rc1)

In order to benchmark video decoding I used the following video clip (10MB version, MD5=1d62b8819bf1433df0dc9b5257f4fc35). Direct link is here: http://trailers.divx.com/Universal/Doom.divx

It does not matter which video to take, my only concern was that it should be freely downloadable in order to be able to compare results from different machines.

My setup is MPlayer 1.0rc1, Nokia 770 (ARM926EJS 250MHz), gcc version 3.4.4 (release) (CodeSourcery ARM 2005q3-2), configured with CFLAGS="-O4 -mcpu=arm926ej-s -fomit-frame-pointer -ffast-math"

# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=7 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 67.369s VO: 0.075s A: 0.000s Sys: 0.600s = 68.043s
BENCHMARKs: VC: 69.296s VO: 0.075s A: 0.000s Sys: 0.630s = 70.001s
BENCHMARKs: VC: 69.346s VO: 0.075s A: 0.000s Sys: 0.622s = 70.044s
BENCHMARKs: VC: 70.332s VO: 0.074s A: 0.000s Sys: 0.674s = 71.080s
BENCHMARKs: VC: 70.067s VO: 0.074s A: 0.000s Sys: 0.617s = 70.758s

# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=10 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 69.828s VO: 0.072s A: 0.000s Sys: 0.605s = 70.506s
BENCHMARKs: VC: 71.838s VO: 0.073s A: 0.000s Sys: 0.629s = 72.539s
BENCHMARKs: VC: 71.903s VO: 0.074s A: 0.000s Sys: 0.634s = 72.611s
BENCHMARKs: VC: 72.563s VO: 0.073s A: 0.000s Sys: 0.626s = 73.262s
BENCHMARKs: VC: 72.373s VO: 0.073s A: 0.000s Sys: 0.653s = 73.099s

# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 64.130s VO: 0.074s A: 0.000s Sys: 0.641s = 64.845s
BENCHMARKs: VC: 65.372s VO: 0.074s A: 0.000s Sys: 0.665s = 66.111s
BENCHMARKs: VC: 65.493s VO: 0.075s A: 0.000s Sys: 0.640s = 66.208s
BENCHMARKs: VC: 66.321s VO: 0.076s A: 0.000s Sys: 0.629s = 67.026s
BENCHMARKs: VC: 66.202s VO: 0.075s A: 0.000s Sys: 0.642s = 66.919s

Here is also the result for FF_IDCT_SIMPLE (just C implementation with no assembly) for comparison:

# mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=2 Doom.divx | grep BENCHMARKs
BENCHMARKs: VC: 71.117s VO: 0.072s A: 0.000s Sys: 0.622s = 71.811s
BENCHMARKs: VC: 72.435s VO: 0.072s A: 0.000s Sys: 0.598s = 73.105s
BENCHMARKs: VC: 72.576s VO: 0.073s A: 0.000s Sys: 0.663s = 73.312s
BENCHMARKs: VC: 73.364s VO: 0.074s A: 0.000s Sys: 0.660s = 74.098s
BENCHMARKs: VC: 73.304s VO: 0.073s A: 0.000s Sys: 0.637s = 74.014s

So the fastest idct for Nokia 770 is FF_IDCT_SIMPLEARMV5TE (number 16), it has some optimizations using armv5te dsp instructions (single cycle 16 x 16 bit multiplication). It is also the default setting for any cpu that supports armv5te instructions in mplayer 1.0rc1 now. This code is the first revision and most likely can be optimized even more. Also the overall results difference because of using different idct implementations use may vary for different video files, I observed performance improvement of up to 10% (on high bitrate but low resolution movies). For this particular file we see that the improvement is only about 6%.

A strange thing here in these benchmarks is that the results are a bit nonconsistent and decoding time slightly increases with each new cycle iteration.

It would be very interesting to see some benchmark results from Zaurus to see which idct works best for it. MPlayer and ffmpeg don't have any iwmmxt optimized idct right now (and it could provide some improvement as it should be able to do two 16 x 16 bit multiplications per cycle).

So more benchmarks are welcome, preferably using the same test file. Or you can suggest some other sample for testing. Also after running these benchmarks, we can see how big is the performance difference between Nokia 770 and Zaurus hardware, which also might be interesting to know smile.gif
danboid
Hi Serge!

I conducted a bunch of benchmark tests using a Zaurus C3000 running pdaXii13 build4 full which includes Meanies build of mplayer 1.0rc1 (which he has named the binary mplayer3) and I used the same Doom divx clip that you linked in all the tests with the same command you used.

For these first four sets of benchmarks the Z was running at the standard 416Mhz setting and the commands were run under an X11 terminal:

------------------------------

idct7:

BENCHMARKs: VC: 58.484s VO: 0.088s A: 0.000s Sys: 2.460s = 61.032s
BENCHMARKs: VC: 57.614s VO: 0.070s A: 0.000s Sys: 0.848s = 58.531s
BENCHMARKs: VC: 57.865s VO: 0.075s A: 0.000s Sys: 0.842s = 58.781s
BENCHMARKs: VC: 57.753s VO: 0.078s A: 0.000s Sys: 0.851s = 58.682s
BENCHMARKs: VC: 57.837s VO: 0.074s A: 0.000s Sys: 0.835s = 58.746s

idct10:

BENCHMARKs: VC: 59.045s VO: 0.072s A: 0.000s Sys: 2.366s = 61.483s
BENCHMARKs: VC: 59.071s VO: 0.070s A: 0.000s Sys: 0.989s = 60.130s
BENCHMARKs: VC: 59.188s VO: 0.071s A: 0.000s Sys: 0.859s = 60.118s
BENCHMARKs: VC: 59.163s VO: 0.071s A: 0.000s Sys: 0.855s = 60.089s
BENCHMARKs: VC: 59.157s VO: 0.070s A: 0.000s Sys: 0.838s = 60.065s

idct16:

BENCHMARKs: VC: 54.462s VO: 0.124s A: 0.000s Sys: 2.615s = 57.201s
BENCHMARKs: VC: 57.047s VO: 0.078s A: 0.000s Sys: 2.020s = 59.145s
BENCHMARKs: VC: 56.930s VO: 0.072s A: 0.000s Sys: 1.586s = 58.588s
BENCHMARKs: VC: 53.739s VO: 0.072s A: 0.000s Sys: 0.859s = 54.670s
BENCHMARKs: VC: 53.948s VO: 0.070s A: 0.000s Sys: 1.672s = 55.690s

idct2:

BENCHMARKs: VC: 59.714s VO: 0.070s A: 0.000s Sys: 2.524s = 62.308s
BENCHMARKs: VC: 61.109s VO: 0.074s A: 0.000s Sys: 1.822s = 63.005s
BENCHMARKs: VC: 60.556s VO: 0.071s A: 0.000s Sys: 0.879s = 61.506s
BENCHMARKs: VC: 60.216s VO: 0.070s A: 0.000s Sys: 0.847s = 61.133s
BENCHMARKs: VC: 60.157s VO: 0.070s A: 0.000s Sys: 0.898s = 61.125s

----------------------------

For the next four sets benchmarks I overclocked to 624Mhz and quit out of X11 and ran the command under the console for max performance:

idct7:

BENCHMARKs: VC: 37.560s VO: 0.072s A: 0.000s Sys: 2.349s = 39.981s
BENCHMARKs: VC: 38.063s VO: 0.049s A: 0.000s Sys: 0.561s = 38.673s
BENCHMARKs: VC: 38.066s VO: 0.050s A: 0.000s Sys: 0.563s = 38.679s
BENCHMARKs: VC: 38.078s VO: 0.050s A: 0.000s Sys: 0.560s = 38.688s
BENCHMARKs: VC: 38.081s VO: 0.050s A: 0.000s Sys: 0.559s = 38.690s

idct10:

BENCHMARKs: VC: 36.988s VO: 0.050s A: 0.000s Sys: 0.562s = 37.600s
BENCHMARKs: VC: 38.759s VO: 0.049s A: 0.000s Sys: 0.559s = 39.368s
BENCHMARKs: VC: 38.770s VO: 0.050s A: 0.000s Sys: 0.563s = 39.382s
BENCHMARKs: VC: 38.718s VO: 0.050s A: 0.000s Sys: 0.560s = 39.328s
BENCHMARKs: VC: 38.736s VO: 0.049s A: 0.000s Sys: 0.559s = 39.344s

idct16:

BENCHMARKs: VC: 33.716s VO: 0.050s A: 0.000s Sys: 0.567s = 34.333s
BENCHMARKs: VC: 35.310s VO: 0.049s A: 0.000s Sys: 0.559s = 35.919s
BENCHMARKs: VC: 35.401s VO: 0.050s A: 0.000s Sys: 0.563s = 36.014s
BENCHMARKs: VC: 35.281s VO: 0.050s A: 0.000s Sys: 0.560s = 35.891s
BENCHMARKs: VC: 35.354s VO: 0.049s A: 0.000s Sys: 0.559s = 35.962s

idct2:

BENCHMARKs: VC: 37.474s VO: 0.050s A: 0.000s Sys: 0.565s = 38.088s
BENCHMARKs: VC: 39.184s VO: 0.049s A: 0.000s Sys: 0.560s = 39.793s
BENCHMARKs: VC: 39.344s VO: 0.050s A: 0.000s Sys: 0.564s = 39.957s
BENCHMARKs: VC: 39.183s VO: 0.050s A: 0.000s Sys: 0.560s = 39.793s
BENCHMARKs: VC: 39.253s VO: 0.049s A: 0.000s Sys: 0.560s = 39.863s

--------------------

So, just as on the 770 it would seem idct16 is clearly the fastest
koen
I ran the benchmark on my ipaq h2200 (400MHz pxa255) and I can see that the memory bus is a bottleneck, since the 770 and pxa270 machines run the bus at a higher speed.
If that isn't the case, arm926 cores kick xscale ass smile.gif

CODE
root@h2200:/data# sh doom-test.sh
idct is 2
BENCHMARKs: VC:  82.432s VO:   0.071s A:   0.000s Sys:   1.293s =   83.796s
BENCHMARKs: VC:  80.798s VO:   0.066s A:   0.000s Sys:   0.916s =   81.780s
BENCHMARKs: VC:  80.758s VO:   0.067s A:   0.000s Sys:   0.912s =   81.737s
BENCHMARKs: VC:  80.676s VO:   0.070s A:   0.000s Sys:   0.897s =   81.643s
BENCHMARKs: VC:  80.649s VO:   0.067s A:   0.000s Sys:   0.950s =   81.665s
idct is 7
BENCHMARKs: VC:  75.593s VO:   0.069s A:   0.000s Sys:   0.902s =   76.564s
BENCHMARKs: VC:  78.993s VO:   0.069s A:   0.000s Sys:   0.903s =   79.965s
BENCHMARKs: VC:  79.248s VO:   0.066s A:   0.000s Sys:   0.933s =   80.246s
BENCHMARKs: VC:  79.242s VO:   0.067s A:   0.000s Sys:   0.931s =   80.239s
BENCHMARKs: VC:  79.080s VO:   0.066s A:   0.000s Sys:   0.904s =   80.050s
idct is 10
BENCHMARKs: VC:  77.020s VO:   0.067s A:   0.000s Sys:   0.905s =   77.992s
BENCHMARKs: VC:  80.152s VO:   0.066s A:   0.000s Sys:   0.905s =   81.124s
BENCHMARKs: VC:  80.219s VO:   0.181s A:   0.000s Sys:   0.903s =   81.303s
BENCHMARKs: VC:  80.238s VO:   0.066s A:   0.000s Sys:   1.024s =   81.328s
BENCHMARKs: VC:  80.359s VO:   0.066s A:   0.000s Sys:   0.906s =   81.331s
idct is 16
BENCHMARKs: VC:  73.140s VO:   0.068s A:   0.000s Sys:   0.916s =   74.124s
BENCHMARKs: VC:  76.616s VO:   0.066s A:   0.000s Sys:   1.014s =   77.695s
BENCHMARKs: VC:  76.927s VO:   0.066s A:   0.000s Sys:   0.905s =   77.899s
BENCHMARKs: VC:  76.992s VO:   0.069s A:   0.000s Sys:   0.906s =   77.966s
BENCHMARKs: VC:  77.157s VO:   0.067s A:   0.000s Sys:   0.940s =   78.165s
Serge
Thanks for running benchmarks. They show that these armv5te optimizations for idct are useful for xscale too. I was just unsure if it is possible to develop a shared code that runs fine on both arm926 and xscale or have to implement two different versions. I'll try to optimize this idct further as much as possible primarily for arm926, but will also keep in mind that this code is also useful on xscale and will take this into account smile.gif Anyway, iwmmxt implementation of idct specifically optimized for xscale may be a better choice (idct takes quite a noticeable fraction of decoding time, so it is at least useful for some machines like zaurus C3000). If anybody skilled with arm assembly would like to try it, I could provide some help with information (but I don't have any machine that can run iwmmxt code anyway).

QUOTE(koen @ Dec 27 2006, 01:27 AM)
I ran the benchmark on my ipaq h2200 (400MHz pxa255) and I can see that the memory bus is a bottleneck, since the 770 and pxa270 machines run the bus at a higher speed.

That's interesting. If memory performance is really very important for mplayer, probably it should be possible to find the parts of code with heavy memory use and optimize memory access patterns for better cache and memory bus utilization. I have already done some tests trying to figure out how to make best use of memory bandwidth on Nokia 770 some time ago: http://maemo.org/pipermail/maemo-developer...ber/006579.html

This information can turn out to be very useful for further optimizations smile.gif

QUOTE
If that isn't the case, arm926 cores kick xscale ass smile.gif

Well, arm926 core should be somewhat faster per clock, here are some links to optimization docs for different arm flavours: http://www.internettablettalk.com/forums/s...read.php?t=2406

But I expected that 416MHz should be still a lot faster because of higher cpu clock frequency. Maybe memory performance is really a limiting factor here and it makes performance of all these chips closer to each other.

Another possible explanation could be nonoptimal set of optimization options or older version of gcc for zaurus builds of mplayer. It should be relatively easy to test mplayer with a different set of optimization options. You can take upstream mplayer 1.0rc1 tarball and compile it using:
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make

It may have some problems with video/audio output drivers if compiled without zaurus specific patches, but this should not be a problem for testing decoding capabilities only smile.gif
Serge
QUOTE(koen @ Dec 25 2006, 04:16 AM)
The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.

That seems a bit too low, I wonder if mplayer was configured and compiled correctly. The point is that motion compensation code in mplayer is currently much better optimized for iwmmxt (that all work was done by atty). You can just look into mplayer sources.

Here is the code used for ARM without iwmmx (libavcodec/armv4l/dsputil_arm.c):
CODE
/*     c->put_pixels_tab[0][0] = put_pixels16_arm; */ // NG!
   c->put_pixels_tab[0][1] = put_pixels16_x2_arm; //OK!
   c->put_pixels_tab[0][2] = put_pixels16_y2_arm; //OK!
/*     c->put_pixels_tab[0][3] = put_pixels16_xy2_arm; /\* NG *\/ */
/*     c->put_no_rnd_pixels_tab[0][0] = put_pixels16_arm; */
   c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_arm; // OK
   c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_arm; //OK
/*     c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_arm; //NG */
   c->put_pixels_tab[1][0] = put_pixels8_arm; //OK
   c->put_pixels_tab[1][1] = put_pixels8_x2_arm; //OK
/*     c->put_pixels_tab[1][2] = put_pixels8_y2_arm; //NG */
/*     c->put_pixels_tab[1][3] = put_pixels8_xy2_arm; //NG */
   c->put_no_rnd_pixels_tab[1][0] = put_pixels8_arm;//OK
   c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_arm; //OK
   c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_arm; //OK
/*     c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_arm;//NG */


Compare it with the following (libavcodec/armv4l/dsputil_iwmmxt.c):
CODE
   c->put_pixels_tab[0][0] = put_pixels16_iwmmxt;
   c->put_pixels_tab[0][1] = put_pixels16_x2_iwmmxt;
   c->put_pixels_tab[0][2] = put_pixels16_y2_iwmmxt;
   c->put_pixels_tab[0][3] = put_pixels16_xy2_iwmmxt;
   c->put_no_rnd_pixels_tab[0][0] = put_pixels16_iwmmxt;
   c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_iwmmxt;
   c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_iwmmxt;
   c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_iwmmxt;

   c->put_pixels_tab[1][0] = put_pixels8_iwmmxt;
   c->put_pixels_tab[1][1] = put_pixels8_x2_iwmmxt;
   c->put_pixels_tab[1][2] = put_pixels8_y2_iwmmxt;
   c->put_pixels_tab[1][3] = put_pixels8_xy2_iwmmxt;
   c->put_no_rnd_pixels_tab[1][0] = put_pixels8_iwmmxt;
   c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_iwmmxt;
   c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_iwmmxt;
   c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_iwmmxt;

   c->avg_pixels_tab[0][0] = avg_pixels16_iwmmxt;
   c->avg_pixels_tab[0][1] = avg_pixels16_x2_iwmmxt;
   c->avg_pixels_tab[0][2] = avg_pixels16_y2_iwmmxt;
   c->avg_pixels_tab[0][3] = avg_pixels16_xy2_iwmmxt;
   c->avg_no_rnd_pixels_tab[0][0] = avg_pixels16_iwmmxt;
   c->avg_no_rnd_pixels_tab[0][1] = avg_no_rnd_pixels16_x2_iwmmxt;
   c->avg_no_rnd_pixels_tab[0][2] = avg_no_rnd_pixels16_y2_iwmmxt;
   c->avg_no_rnd_pixels_tab[0][3] = avg_no_rnd_pixels16_xy2_iwmmxt;

   c->avg_pixels_tab[1][0] = avg_pixels8_iwmmxt;
   c->avg_pixels_tab[1][1] = avg_pixels8_x2_iwmmxt;
   c->avg_pixels_tab[1][2] = avg_pixels8_y2_iwmmxt;
   c->avg_pixels_tab[1][3] = avg_pixels8_xy2_iwmmxt;
   c->avg_no_rnd_pixels_tab[1][0] = avg_no_rnd_pixels8_iwmmxt;
   c->avg_no_rnd_pixels_tab[1][1] = avg_no_rnd_pixels8_x2_iwmmxt;
   c->avg_no_rnd_pixels_tab[1][2] = avg_no_rnd_pixels8_y2_iwmmxt;
   c->avg_no_rnd_pixels_tab[1][3] = avg_no_rnd_pixels8_xy2_iwmmxt;


As you see, machines that support iwmmxt have all the motion compensation related functions implemented in hand optimized assembly. It is strange that it only results in about 2% improvement.

QUOTE
The c7x0 models would benefit from people helping the libw100 project.

I see, but I can't provide any help here as I don't have any hardware but Nokia 770, more people interested in improving mplayer performance on different ARM devices are welcome here smile.gif

I can only do assembly optimizations for ffmpeg using armv5te instruction set (including fast single cycle multiply dsp instructions).

Concerning the current progress, I have done some modification to valgrind (callgrind part) to make it simulate read-allocate cache behaviour (arm926 uses such cache) and now have some information about parts of code that cause many cache missed and do lots of work with the memory.

Things that may need optimizations and provide some improvement are:
  • idct
  • motion compensation (for non iwmmxt devices)
  • dct_unquantize_h263_intra function (it contains almost 7% of instructions executed from callgrind statistics for this Doom video fragment, in addition it contains lots of multiplications which can be accelerated using dsp instructions), one more proof that it is needed to be optimized is that x86 code also contains mmx version of this function smile.gif
Also I can prepare some small test programs for synthetic benchmarking of all these parts of code (idct, motion compensation, unquantize) so that it will be easier to see if there is any effect of optimizations. It is hard to notice any substantial effects of each one of these optimizations when just monitoring full video decoding time, but they all are cumulative and all together can provide quite a visible improvement. I have already done something like this when tried to optimize idct code (not too successful attempt because it focused on the code that was not real bottleneck, rows processing in idct generally takes much less time than columns):
http://lists.mplayerhq.hu/pipermail/ffmpeg...ber/045837.html

Would anyone want to try running these benchmarks, or take some more active part in optimizing mplayer/ffmpeg? wink.gif

PS. By the way, is it possible to watch that Doom video clip without (much) framedrops on nonoverclocked Zaurus?
danboid
Hi Serge!

I'm willing to do some more benchmarking if it will assist mplayer ARM development
Civil
QUOTE
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math"

There is no "-O4". Maximum optimization is -O3. And be careful with it. Sometimes it is better to use -O2 or even -Os for performance... If you do more optimization - binary grows lager.... And -fomit-frame-pointer is enabled in -O, -O2, -O3, -Os
On ARM version of GCC there is a little difference (acording to man gcc) betwen -mcpu=iwmmxt and -mtune=iwmmxt. So for max. performance it is good to use both.
http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...ptimize-Options
http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...tml#ARM-Options

QUOTE
-mtune=name
    This option is very similar to the -mcpu= option, except that instead of specifying the actual target processor type, and hence restricting which instructions can be used, it specifies that GCC should tune the performance of the code as if the target were of the type specified in this option, but still choosing the instructions that it will generate based on the cpu specified by a -mcpu= option. For some ARM implementations better performance can be obtained by using this option.
Serge
civil: http://www.hpc.ru/board/viewtopic.php?t=99079&start=10
Please read my old reply to the same your old question in Russian. I tried to use some online web translator, but the result is not very much readable: http://www.online-translator.com/url/tran_...=0&psubmit2.y=0

Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results. Unfortunately you did not provide any benchmarks even after you have been asked for it. I would appreciate if we keep discussion constructive and friendly here and don't start discussing some theoretical matters about how gcc is supposed to work. Thanks.
danboid
Yeah Civil, be civil

(Sorry, couldn't resist tongue.gif )
Civil
Serge
It was just comments... I don't know english so well to make correct senteces, so I write as I can...

QUOTE
Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results.

I'll try to compile mplyaer 1.0 rc1 with different options:
1) -O2 -mtune=iwmmxt -mcpu=iwmmxt
2) -O3 -mtune=iwmmxt -mcpu=iwmmxt
3) -O3 -mtune=iwmmxt -mcpu=iwmmxt -fomit-frame-pointer
and maybe with others. It depends on time wich it'll take to compile mplayer on Z. And then I'll post becnhmark results here, in this post. And then I'll post results wich I've got using mplayer from cacko.
Serge
Done some patch for 'dct_unquantize_h263_intra' function today:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050356.html

It should be useful for armv5te devices which do not have iwmmxt support (for Nokia 770 and probably for XScale chips older than PXA27x). This dct_unquantize_h263_intra function takes about 7% of decoding for Doom.xvid trailer, optimizing this function provides a visible performance improvement at least for this particular video file smile.gif

Probably it can be optimized even more and a better final version of this patch will be available a bit later.
Serge
OK, committed 'dct_unquantize_h263_intra' optimization to maemo mplayer svn. It would be interesting to see the results of running 'test-unquantize' test program to benchmark how it behaves on XScale. Some details about the results from Nokia 770 are here: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html

Here are some step by step instructions:
1. Checkout maemo mplayer svn: 'svn co https://garage.maemo.org/svn/mplayer/trunk maemo-mplayer'
2. Go to 'maemo-mplayer/libavcodec/tests'
3. Compile the test program using supplied makefile (you will need to set CC and CFLAGS variables according to the name of your compiler and preferred optimizations settings), you can check 'build-tests-n770.sh' as an example of settings for compiling this test program for Nokia 770 (using crosscompiler from gentoo crossdev)
4. Run test program on your device and post the results here smile.gif

This optimization may be useful for PXA255 or other XScale chips that do not have iwmmx support (do I understand that correctly?). This 'dct_unquantize_h263' function also has iwmmxt optimized implementation in mplayer and it should be used on the latest xscale chips (and SIMD instructions from iwmmxt should be much better for this kind of code). By the way, absence of iwmmxt support could also explain very poor results from PXA255 box provided by koen. Can somebody investigate what's the matter as not everything is clear yet?
Serge
Well, some more optimizations for h263 unquantizer, I think it is a final version and it is hardly possible to optimize it more (for armv5te) smile.gif

Test from Nokia 770:
CODE
/media/mmc1 $ ./test-unquantize
no cpu clock frequency specified, trying to autodetect it...
... detected as 251.2MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.07063 usec per element, or 17.7 cycles (251.2MHz)
dct_unquantize_h263_special_helper_armv5te time=0.02692 usec per element, or 6.8 cycles (251.2MHz)


I wonder how it performs on XScale per clock as loads are now done as 64-bits at a time using LDRD instruction (see my previous post about the details how to run the test).

PS. Thanks to koen for running previous benchmark, it showed that assembly optimized code for dct_unquantize_h263 is also roughly 2x faster than gcc generated code on XScale. But it would be interesting to see some results with this final patch.

Edit: Result for 400MHz XScale cpu (from koen):
CODE
root@h2200:/data/site/mplayer/libavcodec/tests# ./test-unquantize 400; ./test-unquantize
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.04329 usec per element, or 17.3 cycles (400.0MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01671 usec per element, or 6.7 cycles (400.0MHz)
no cpu clock frequency specified, trying to autodetect it...
... detected as 376.7MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.04277 usec per element, or 16.1 cycles (376.7MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01655 usec per element, or 6.2 cycles (376.7MHz)
Serge
Just for additional statistics, 'Doom benchmark' for Nokia N800 (keep in mind that MPlayer is not optimized for ARMv6 SIMD instructions at all right now, so these results have a good potential for improving):
CODE
mplayer -benchmark -lavdopts idct=16 -nosound -vo null -loop 5 -quiet Doom.divx
BENCHMARKs: VC:  47.556s VO:   0.069s A:   0.000s Sys:   0.634s =   48.259s
BENCHMARKs: VC:  48.413s VO:   0.071s A:   0.000s Sys:   0.618s =   49.101s
BENCHMARKs: VC:  48.561s VO:   0.073s A:   0.000s Sys:   0.593s =   49.228s
BENCHMARKs: VC:  48.731s VO:   0.072s A:   0.000s Sys:   0.624s =   49.427s
BENCHMARKs: VC:  49.398s VO:   0.072s A:   0.000s Sys:   0.633s =   50.102s
Serge
Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml

As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized smile.gif
Meanie
QUOTE(Serge @ Jan 18 2007, 09:37 AM)
Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml

As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized smile.gif
*


There are several flavours of Zaurus OS which all have different hard/soft float requirements. The default Sharp ROM (and also Cacko ROM) use hardfloat. The pdaXrom distribution for Zaurus uses softvfp. OZ (OpenZaurus) uses yet another variant of softfloat...
The latest builds of mplayer rc1 were mainly build for pdaXrom.
Serge
Here is a new progress update report smile.gif I have implemented an initial version of JIT accelerated scaler for planar YUV420 -> packed YUV422 color format. It provides a very nice performance improvement for Nokia 770 already in a new mplayer build for maemo: mplayer_1.0rc1-maemo.8

I will try to get this code integrated into upstream ffmpeg library so that other ARM devices (such as PXA270?) could make use of it and have all the performance problems with scaling solved. Here is a link with some more information, it also includes benchmark results (using the same Doom video clip): http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/051209.html
lardman
Serge,

I'll build your comparison benchmarks for the PXA255 (and SA1110 if it's of interest) once I've got over some minor (I hope) OE build issues.

Si
Civil
QUOTE
Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you.

Just Haven't got enough time for tests (exams...).
Default compiler options ( -O4 -pipe -ffasth-math -fomit-frame-pointer ):
BENCHMARKs: VC: 52.561s VO: 0.065s A: 0.000s Sys: 0.793s = 53.419s
BENCHMARKs: VC: 56.284s VO: 0.066s A: 0.000s Sys: 0.795s = 57.145s
BENCHMARKs: VC: 56.476s VO: 0.065s A: 0.000s Sys: 0.797s = 57.339s
BENCHMARKs: VC: 56.319s VO: 0.065s A: 0.000s Sys: 0.796s = 57.180s
BENCHMARKs: VC: 56.434s VO: 0.065s A: 0.000s Sys: 0.799s = 57.290s

-O2 -pipe -march=iwmmxt -mcpu=iwmmxt -mtune=iwmmxt -msoft-float:
BENCHMARKs: VC: 53.703s VO: 0.066s A: 0.000s Sys: 0.915s = 54.685s
BENCHMARKs: VC: 56.455s VO: 0.066s A: 0.000s Sys: 0.803s = 57.324s
BENCHMARKs: VC: 56.513s VO: 0.066s A: 0.000s Sys: 0.799s = 57.377s
BENCHMARKs: VC: 56.458s VO: 0.065s A: 0.000s Sys: 0.798s = 57.322s
BENCHMARKs: VC: 56.456s VO: 0.065s A: 0.000s Sys: 0.800s = 57.321s

P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...)
P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx
P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus...
Serge
QUOTE(Civil @ Jan 28 2007, 11:58 AM)
P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...)

Thanks for running these tests. It shows that the results for -O3 (-O4) are pretty much the same as -O2, it would be interesting to compare them against -Os as this option is most commonly used on embedded devices.

By the way, why iwmmxt was not used? It should provide quite a noticeable improvement, at least theoreticaly smile.gif

QUOTE
P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx
P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus...
*

Thanks, I'm anticipating more test results. While compiler optimization options are unlikely to provide big improvement, but every little bit helps.
Serge
QUOTE(adf @ Jan 28 2007, 12:36 PM)
apologies for straying off topic- I'm actually interested in the mplayer work.
BUT--the I followed the gentoo link in the last post. if progress is being made, it certainly  desrves some attention.  A mainstream distro like gento that compiles and runs on a Z (well optimized, etc) has been a sort of holy grail for quite a few zaurus users.  By all means encourage the people working on the project to post here
*

Wouldn't it be better to create a new topic for discussing gentoo on zaurus? smile.gif Otherwise we risk to turn this topic into a mess.
Civil
QUOTE
Wouldn't it be better to create a new topic for discussing gentoo on zaurus? smile.gif Otherwise we risk to turn this topic into a mess.

I'm not discussing... And I'm not a developer, so I think author (Mesk) must post about it. I've posted just basic info for people to know about system I'm running now.
Serge
Some more mplayer related news, mplayer port for maemo should now be more or less usable on Nokia N800 (video freeze issues fixed by using video output code with direct framebuffer access just like on Nokia 770). Once accommodation to this new device is finished, code optimization activity will be resumed smile.gif
tjchick
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code. I have done this for the results below.

Here are my benchmark results on a standard Sl-C3200, not overclocked, running open zaurus:

BENCHMARKs: VC: 44.056s VO: 0.078s A: 0.000s Sys: 0.831s = 44.965s
BENCHMARK%: VC: 97.9787% VO: 0.1734% A: 0.0000% Sys: 1.8479% = 100.0000%
BENCHMARKs: VC: 43.234s VO: 0.079s A: 0.000s Sys: 0.816s = 44.128s
BENCHMARK%: VC: 97.9734% VO: 0.1785% A: 0.0000% Sys: 1.8481% = 100.0000%
BENCHMARKs: VC: 43.487s VO: 0.076s A: 0.000s Sys: 0.813s = 44.376s
BENCHMARK%: VC: 97.9957% VO: 0.1715% A: 0.0000% Sys: 1.8328% = 100.0000%
BENCHMARKs: VC: 43.669s VO: 0.076s A: 0.000s Sys: 0.820s = 44.565s
BENCHMARK%: VC: 97.9891% VO: 0.1712% A: 0.0000% Sys: 1.8398% = 100.0000%
BENCHMARKs: VC: 43.497s VO: 0.078s A: 0.000s Sys: 0.810s = 44.386s
BENCHMARK%: VC: 97.9976% VO: 0.1764% A: 0.0000% Sys: 1.8260% = 100.0000%

Tim
Serge
QUOTE(tjchick @ Mar 14 2007, 07:39 AM)
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.

Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?

Did you try using something similar to what I suggested in this thread before?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
tjchick
QUOTE(Serge @ Mar 14 2007, 05:14 PM)
QUOTE(tjchick @ Mar 14 2007, 07:39 AM)
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.

Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?

Did you try using something similar to what I suggested in this thread before?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
*



Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.

It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?

That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.

Tim
Meanie
QUOTE(tjchick @ Mar 15 2007, 02:29 AM)
QUOTE(Serge @ Mar 14 2007, 05:14 PM)
QUOTE(tjchick @ Mar 14 2007, 07:39 AM)
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.

Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?

Did you try using something similar to what I suggested in this thread before?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
*



Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.

It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?

That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.

Tim
*



if you pull latest source from svn, you can just use --enable-iwmmxt
Serge
QUOTE(tjchick @ Mar 14 2007, 08:29 AM)
Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.

It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?

That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.

Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html

Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream. If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems.

PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1 sad.gif

edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all.

And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start smile.gif
tjchick
[quote=Serge,Mar 14 2007, 06:32 PM]
Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html

Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream.

[\quote]
I already did this stuff yesteday, before I saw your messages. Yes Meanie, even latest SVN does not fix matters. I posted a patch to the ffmpeg dev mailing list, got some feedback and posted another patch. Am awaiting the response.

[quote]
If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems.

PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1 sad.gif

edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all.

And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start smile.gif
*

[/quote]
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.

I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase. So either the version which was in use was pretty good (be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output), or the system is memory bound as others have suggested. It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)

Here are the new results:
BENCHMARKs: VC: 43.497s
BENCHMARKs: VC: 42.813s
BENCHMARKs: VC: 43.040s
BENCHMARKs: VC: 43.269s
BENCHMARKs: VC: 43.090s

Thanks,
Tim
Serge
QUOTE(tjchick @ Mar 15 2007, 01:51 AM)
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.

That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.

QUOTE
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.

This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html

QUOTE
So either the version which was in use was pretty good

It was just not performance critical, I wonder why you even managed to see some improvement wink.gif

QUOTE
(be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output),

Actually I find compiler generated code for ARM quite poorly optimized. It can't make the good use of conditionally executed instructions, can't use DSP instructions, schedule code in an optimal way to avoid pipeline stalls. Of course, it only makes sense optimizing code that is bottleneck to gain any visible performance improvement overall.

I prefer to always develop some simple performance and correctness tests for the performance critical functions I'm optimizing. So I can ensure that they really provide performance improvement and do not introduce stability issues.

Random assembly hacking is not a productive way of working for sure smile.gif

QUOTE
or the system is memory bound as others have suggested.

This particular function is run on fully cached data, so memory access time is not important here. I investigated mplayer memory access pattern using valgrind (callgrind tool) getting more or less precise information about cache misses.

Code that heavily depends on memory performance is in motion compensation functions and partially idct (cache write misses for destination buffer).

QUOTE
It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)

Yes, paying special attention at accessing memory properly and using prefetch can improve performance quite noticeably.

PS. In order to ensure that video is decoded not only fast, but also right, you can use '-vo md5' option. I noticed some really ugly video decoding artefacts when using standard ARM optimized IDCT (some vertical stripes on panning scenes), ARMv5TE optimized IDCT is identical to C implementation.
tjchick
QUOTE(Serge @ Mar 15 2007, 07:52 PM)
QUOTE(tjchick @ Mar 15 2007, 01:51 AM)
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.

That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.

I think it does, as I know the cacko mplayer-atty is faster again than "mine", and that uses the IPP stuff for idct. I was not really interested in trying it though, due to the license restrictions of IPP.

QUOTE
QUOTE
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.

This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770:



On thing I'm going to do is compare the iwmmxt code against your armv5te code, performance wise.

Cheers,
Tim
Meanie
actually, i think your new build is much faster than atty's in decoding speed.

here is the benchmarks result of running atty's iwmmxt optimized build of mplayer on C3000 with pdaXrom

BENCHMARKs: VC: 40.385s VO: 0.068s A: 0.000s Sys: 0.863s = 41.315s
BENCHMARKs: VC: 47.495s VO: 0.067s A: 0.000s Sys: 0.860s = 48.421s
BENCHMARKs: VC: 45.600s VO: 0.067s A: 0.000s Sys: 0.843s = 46.509s
BENCHMARKs: VC: 45.629s VO: 0.068s A: 0.000s Sys: 0.865s = 46.562s
BENCHMARKs: VC: 45.820s VO: 0.068s A: 0.000s Sys: 0.859s = 46.748s

for comparison, here is the benchmark results of the SVN mplayer code with armv5te enabled and xscale tuning CC flags

BENCHMARKs: VC: 52.105s VO: 0.026s A: 0.000s Sys: 1.047s = 53.178s
BENCHMARKs: VC: 53.503s VO: 0.027s A: 0.000s Sys: 0.923s = 54.453s
BENCHMARKs: VC: 54.030s VO: 0.027s A: 0.000s Sys: 0.914s = 54.970s
BENCHMARKs: VC: 53.926s VO: 0.027s A: 0.000s Sys: 0.931s = 54.883s
BENCHMARKs: VC: 53.267s VO: 0.034s A: 0.000s Sys: 0.927s = 54.228s
tjchick
On cacko on c1000, I see:
VC: 36.186
VC: 36.927
VC: 37.662
VC: 36.932
VC: 37.016

And similar figures for sys. Cacko uses attys mplayer, which still seems to be the best by quite a margin!

At a guess this is due to IPP for IDCT.

Thanks,
Tim
Serge
You can try to override idct by using '-lavdopts idct=<some_number>' in atty's build and test it. After getting the numbers we can see if it is really IPP that matters, or maybe atty's build has some other optimizations.

By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
koen
QUOTE(Serge @ Mar 21 2007, 04:26 PM)
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
*


Except that ARM has no immediate assignments and needs aligned data...
Serge
QUOTE(koen @ Mar 21 2007, 08:42 AM)
QUOTE(Serge @ Mar 21 2007, 04:26 PM)
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
*

Except that ARM has no immediate assignments

MMX instruction set does not have immediate assignments either wink.gif In any case, that's not a big deal.

QUOTE
and needs aligned data...

FFmpeg does special care for alignment, many functions have guaranteed alignment specified for the data they are processing (some SSE instructions require 16-byte alignment after all, so ARM is not the most strict in this respect). Input data for IDCT is also 16-byte aligned for example, that's more than enough for ARM smile.gif

Anyway, somebody just needs to give it a try. To encourage you more and prove that it might work, looks like atty took the existing MMX implementation of dct_unquantize_h263_intra_mmx and converted it to dct_unquantize_h263_intra_iwmmxt smile.gif Probably he did not care about IDCT as he could just use IPP instead, so maybe doing a conversion from MMX to IWMMXT for IDCT is also possible with not so much work (everything is relative of course). I wonder what implementation would be faster? On one hand IPP is a library developed by professionals from Intel, on the other hand FFmpeg proved to be very well optimized beating many other codecs on x86 platform and default IDCT used in it is MMX optimized.
koen
QUOTE(Serge @ Mar 22 2007, 06:56 PM)
QUOTE(koen @ Mar 21 2007, 08:42 AM)
QUOTE(Serge @ Mar 21 2007, 04:26 PM)
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
*

Except that ARM has no immediate assignments

MMX instruction set does not have immediate assignments either wink.gif In any case, that's not a big deal.

QUOTE
and needs aligned data...

FFmpeg does special care for alignment, many functions have guaranteed alignment specified for the data they are processing (some SSE instructions require 16-byte alignment after all, so ARM is not the most strict in this respect). Input data for IDCT is also 16-byte aligned for example, that's more than enough for ARM smile.gif

*



Right, o-hand ported the fbmmx layer in the xserver to iwmmx but it wasn't faster since you had to align the data by hand. Maybe ffmpeg can gain more.
tjchick
QUOTE(Serge @ Mar 21 2007, 05:26 PM)
You can try to override idct by using '-lavdopts idct=<some_number>'  in atty's build and test it. After getting the numbers we can see if it is really IPP that matters, or maybe atty's build has some other optimizations.


I did try it, and using the non-IPP IDCT produces results which are comparable ish. atty mplayer is still faster by 10% or so, so there are still a few more tweaks I need to sort out, but it was 40% better when using ipp.

Cheers,
Tim
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2014 Invision Power Services, Inc.