Author Topic: Mplayer Development And Optimization For Arm (Read 91699 times)

Civil · « **Reply #30 on:** January 28, 2007, 02:58:03 pm »

Quote

Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you.

Just Haven't got enough time for tests (exams...).
Default compiler options ( -O4 -pipe -ffasth-math -fomit-frame-pointer ):
BENCHMARKs: VC: 52.561s VO: 0.065s A: 0.000s Sys: 0.793s = 53.419s
BENCHMARKs: VC: 56.284s VO: 0.066s A: 0.000s Sys: 0.795s = 57.145s
BENCHMARKs: VC: 56.476s VO: 0.065s A: 0.000s Sys: 0.797s = 57.339s
BENCHMARKs: VC: 56.319s VO: 0.065s A: 0.000s Sys: 0.796s = 57.180s
BENCHMARKs: VC: 56.434s VO: 0.065s A: 0.000s Sys: 0.799s = 57.290s

-O2 -pipe -march=iwmmxt -mcpu=iwmmxt -mtune=iwmmxt -msoft-float:
BENCHMARKs: VC: 53.703s VO: 0.066s A: 0.000s Sys: 0.915s = 54.685s
BENCHMARKs: VC: 56.455s VO: 0.066s A: 0.000s Sys: 0.803s = 57.324s
BENCHMARKs: VC: 56.513s VO: 0.066s A: 0.000s Sys: 0.799s = 57.377s
BENCHMARKs: VC: 56.458s VO: 0.065s A: 0.000s Sys: 0.798s = 57.322s
BENCHMARKs: VC: 56.456s VO: 0.065s A: 0.000s Sys: 0.800s = 57.321s

P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...)
P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx
P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus...

Serge · « **Reply #31 on:** January 28, 2007, 03:53:08 pm »

Quote

P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...)

Thanks for running these tests. It shows that the results for -O3 (-O4) are pretty much the same as -O2, it would be interesting to compare them against -Os as this option is most commonly used on embedded devices.

By the way, why iwmmxt was not used? It should provide quite a noticeable improvement, at least theoreticaly

Quote

P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx
P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus...
[div align=\"right\"][a href=\"index.php?act=findpost&pid=152745\"][{POST_SNAPBACK}][/a][/div]

Thanks, I'm anticipating more test results. While compiler optimization options are unlikely to provide big improvement, but every little bit helps.

Serge · « **Reply #32 on:** January 28, 2007, 03:55:18 pm »

Quote

apologies for straying off topic- I'm actually interested in the mplayer work.
BUT--the I followed the gentoo link in the last post. if progress is being made, it certainly desrves some attention. A mainstream distro like gento that compiles and runs on a Z (well optimized, etc) has been a sort of holy grail for quite a few zaurus users. By all means encourage the people working on the project to post here
[div align=\"right\"][a href=\"index.php?act=findpost&pid=152749\"][{POST_SNAPBACK}][/a][/div]

Wouldn't it be better to create a new topic for discussing gentoo on zaurus? Otherwise we risk to turn this topic into a mess.

Civil · « **Reply #33 on:** January 28, 2007, 04:04:00 pm »

Quote

Wouldn't it be better to create a new topic for discussing gentoo on zaurus? smile.gif Otherwise we risk to turn this topic into a mess.

I'm not discussing... And I'm not a developer, so I think author (Mesk) must post about it. I've posted just basic info for people to know about system I'm running now.

Serge · « **Reply #34 on:** February 14, 2007, 04:57:06 pm »

Some more mplayer related news, mplayer port for maemo should now be more or less usable on Nokia N800 (video freeze issues fixed by using video output code with direct framebuffer access just like on Nokia 770). Once accommodation to this new device is finished, code optimization activity will be resumed

tjchick · « **Reply #35 on:** March 14, 2007, 11:39:55 am »

Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code. I have done this for the results below.

Here are my benchmark results on a standard Sl-C3200, not overclocked, running open zaurus:

BENCHMARKs: VC: 44.056s VO: 0.078s A: 0.000s Sys: 0.831s = 44.965s
BENCHMARK%: VC: 97.9787% VO: 0.1734% A: 0.0000% Sys: 1.8479% = 100.0000%
BENCHMARKs: VC: 43.234s VO: 0.079s A: 0.000s Sys: 0.816s = 44.128s
BENCHMARK%: VC: 97.9734% VO: 0.1785% A: 0.0000% Sys: 1.8481% = 100.0000%
BENCHMARKs: VC: 43.487s VO: 0.076s A: 0.000s Sys: 0.813s = 44.376s
BENCHMARK%: VC: 97.9957% VO: 0.1715% A: 0.0000% Sys: 1.8328% = 100.0000%
BENCHMARKs: VC: 43.669s VO: 0.076s A: 0.000s Sys: 0.820s = 44.565s
BENCHMARK%: VC: 97.9891% VO: 0.1712% A: 0.0000% Sys: 1.8398% = 100.0000%
BENCHMARKs: VC: 43.497s VO: 0.078s A: 0.000s Sys: 0.810s = 44.386s
BENCHMARK%: VC: 97.9976% VO: 0.1764% A: 0.0000% Sys: 1.8260% = 100.0000%

Tim

Serge · « **Reply #36 on:** March 14, 2007, 12:14:16 pm »

Quote

Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.

Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?

Did you try using something similar to what I suggested in this thread before?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make

tjchick · « **Reply #37 on:** March 14, 2007, 12:29:06 pm »

Quote

Quote
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.
Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?

Did you try using something similar to what I suggested in this thread before?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156266\"][{POST_SNAPBACK}][/a][/div]

Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.

It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?

That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.

Tim

Meanie · « **Reply #38 on:** March 14, 2007, 12:53:41 pm »

Quote

Quote
Quote
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.
Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?

Did you try using something similar to what I suggested in this thread before?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156266\"][{POST_SNAPBACK}][/a][/div]

Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.

It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?

That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.

Tim
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156267\"][{POST_SNAPBACK}][/a][/div]

if you pull latest source from svn, you can just use --enable-iwmmxt

Serge · « **Reply #39 on:** March 14, 2007, 01:32:04 pm »

Quote

Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.

It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?

That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.

Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html

Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream. If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems.

PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1

edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all.

And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start

tjchick · « **Reply #40 on:** March 15, 2007, 05:51:04 am »

Quote from: Serge,Mar 14 2007, 06:32 PM

Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html

Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream.

[\quote]
I already did this stuff yesteday, before I saw your messages. Yes Meanie, even latest SVN does not fix matters. I posted a patch to the ffmpeg dev mailing list, got some feedback and posted another patch. Am awaiting the response.

Quote
If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems.

PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1

edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all.

And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156280\"][{POST_SNAPBACK}][/a][/div]
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.

I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase. So either the version which was in use was pretty good (be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output), or the system is memory bound as others have suggested. It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)

Here are the new results:
BENCHMARKs: VC: 43.497s
BENCHMARKs: VC: 42.813s
BENCHMARKs: VC: 43.040s
BENCHMARKs: VC: 43.269s
BENCHMARKs: VC: 43.090s

Thanks,
Tim

Serge · « **Reply #41 on:** March 15, 2007, 02:52:36 pm »

Quote

Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.

That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.

Quote

I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.

This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html

Quote

So either the version which was in use was pretty good

It was just not performance critical, I wonder why you even managed to see some improvement

Quote

(be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output),

Actually I find compiler generated code for ARM quite poorly optimized. It can't make the good use of conditionally executed instructions, can't use DSP instructions, schedule code in an optimal way to avoid pipeline stalls. Of course, it only makes sense optimizing code that is bottleneck to gain any visible performance improvement overall.

I prefer to always develop some simple performance and correctness tests for the performance critical functions I'm optimizing. So I can ensure that they really provide performance improvement and do not introduce stability issues.

Random assembly hacking is not a productive way of working for sure

Quote

or the system is memory bound as others have suggested.

This particular function is run on fully cached data, so memory access time is not important here. I investigated mplayer memory access pattern using valgrind (callgrind tool) getting more or less precise information about cache misses.

Code that heavily depends on memory performance is in motion compensation functions and partially idct (cache write misses for destination buffer).

Quote

It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)

Yes, paying special attention at accessing memory properly and using prefetch can improve performance quite noticeably.

PS. In order to ensure that video is decoded not only fast, but also right, you can use '-vo md5' option. I noticed some really ugly video decoding artefacts when using standard ARM optimized IDCT (some vertical stripes on panning scenes), ARMv5TE optimized IDCT is identical to C implementation.

tjchick · « **Reply #42 on:** March 15, 2007, 04:05:03 pm »

Quote

Quote
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.
That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.

I think it does, as I know the cacko mplayer-atty is faster again than "mine", and that uses the IPP stuff for idct. I was not really interested in trying it though, due to the license restrictions of IPP.

Quote

Quote
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.
This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770:

On thing I'm going to do is compare the iwmmxt code against your armv5te code, performance wise.

Cheers,
Tim

Meanie · « **Reply #43 on:** March 15, 2007, 07:39:30 pm »

actually, i think your new build is much faster than atty's in decoding speed.

here is the benchmarks result of running atty's iwmmxt optimized build of mplayer on C3000 with pdaXrom

BENCHMARKs: VC: 40.385s VO: 0.068s A: 0.000s Sys: 0.863s = 41.315s
BENCHMARKs: VC: 47.495s VO: 0.067s A: 0.000s Sys: 0.860s = 48.421s
BENCHMARKs: VC: 45.600s VO: 0.067s A: 0.000s Sys: 0.843s = 46.509s
BENCHMARKs: VC: 45.629s VO: 0.068s A: 0.000s Sys: 0.865s = 46.562s
BENCHMARKs: VC: 45.820s VO: 0.068s A: 0.000s Sys: 0.859s = 46.748s

for comparison, here is the benchmark results of the SVN mplayer code with armv5te enabled and xscale tuning CC flags

BENCHMARKs: VC: 52.105s VO: 0.026s A: 0.000s Sys: 1.047s = 53.178s
BENCHMARKs: VC: 53.503s VO: 0.027s A: 0.000s Sys: 0.923s = 54.453s
BENCHMARKs: VC: 54.030s VO: 0.027s A: 0.000s Sys: 0.914s = 54.970s
BENCHMARKs: VC: 53.926s VO: 0.027s A: 0.000s Sys: 0.931s = 54.883s
BENCHMARKs: VC: 53.267s VO: 0.034s A: 0.000s Sys: 0.927s = 54.228s

tjchick · « **Reply #44 on:** March 21, 2007, 11:10:00 am »

On cacko on c1000, I see:
VC: 36.186
VC: 36.927
VC: 37.662
VC: 36.932
VC: 37.016

And similar figures for sys. Cacko uses attys mplayer, which still seems to be the best by quite a margin!

At a guess this is due to IPP for IDCT.

Thanks,
Tim

News:

Author Topic: Mplayer Development And Optimization For Arm (Read 91699 times)