Author Topic: Mplayer Development And Optimization For Arm (Read 91452 times)

Serge · « **Reply #15 on:** December 27, 2006, 05:16:21 pm »

Thanks for running benchmarks. They show that these armv5te optimizations for idct are useful for xscale too. I was just unsure if it is possible to develop a shared code that runs fine on both arm926 and xscale or have to implement two different versions. I'll try to optimize this idct further as much as possible primarily for arm926, but will also keep in mind that this code is also useful on xscale and will take this into account Anyway, iwmmxt implementation of idct specifically optimized for xscale may be a better choice (idct takes quite a noticeable fraction of decoding time, so it is at least useful for some machines like zaurus C3000). If anybody skilled with arm assembly would like to try it, I could provide some help with information (but I don't have any machine that can run iwmmxt code anyway).

Quote

I ran the benchmark on my ipaq h2200 (400MHz pxa255) and I can see that the memory bus is a bottleneck, since the 770 and pxa270 machines run the bus at a higher speed.

That's interesting. If memory performance is really very important for mplayer, probably it should be possible to find the parts of code with heavy memory use and optimize memory access patterns for better cache and memory bus utilization. I have already done some tests trying to figure out how to make best use of memory bandwidth on Nokia 770 some time ago: http://maemo.org/pipermail/maemo-developer...ber/006579.html

This information can turn out to be very useful for further optimizations

Quote

If that isn't the case, arm926 cores kick xscale ass

Well, arm926 core should be somewhat faster per clock, here are some links to optimization docs for different arm flavours: http://www.internettablettalk.com/forums/s...read.php?t=2406

But I expected that 416MHz should be still a lot faster because of higher cpu clock frequency. Maybe memory performance is really a limiting factor here and it makes performance of all these chips closer to each other.

Another possible explanation could be nonoptimal set of optimization options or older version of gcc for zaurus builds of mplayer. It should be relatively easy to test mplayer with a different set of optimization options. You can take upstream mplayer 1.0rc1 tarball and compile it using:
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make

It may have some problems with video/audio output drivers if compiled without zaurus specific patches, but this should not be a problem for testing decoding capabilities only

Serge · « **Reply #16 on:** December 31, 2006, 03:40:12 pm »

Quote

The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.

That seems a bit too low, I wonder if mplayer was configured and compiled correctly. The point is that motion compensation code in mplayer is currently much better optimized for iwmmxt (that all work was done by atty). You can just look into mplayer sources.

Here is the code used for ARM without iwmmx (libavcodec/armv4l/dsputil_arm.c):

Code: [Select]

/*     c->put_pixels_tab[0][0] = put_pixels16_arm; */ // NG!
    c->put_pixels_tab[0][1] = put_pixels16_x2_arm; //OK!
    c->put_pixels_tab[0][2] = put_pixels16_y2_arm; //OK!
/*     c->put_pixels_tab[0][3] = put_pixels16_xy2_arm; /\* NG *\/ */
/*     c->put_no_rnd_pixels_tab[0][0] = put_pixels16_arm; */
    c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_arm; // OK
    c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_arm; //OK
/*     c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_arm; //NG */
    c->put_pixels_tab[1][0] = put_pixels8_arm; //OK
    c->put_pixels_tab[1][1] = put_pixels8_x2_arm; //OK
/*     c->put_pixels_tab[1][2] = put_pixels8_y2_arm; //NG */
/*     c->put_pixels_tab[1][3] = put_pixels8_xy2_arm; //NG */
    c->put_no_rnd_pixels_tab[1][0] = put_pixels8_arm;//OK
    c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_arm; //OK
    c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_arm; //OK
/*     c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_arm;//NG */

Compare it with the following (libavcodec/armv4l/dsputil_iwmmxt.c):

Code: [Select]

    c->put_pixels_tab[0][0] = put_pixels16_iwmmxt;
    c->put_pixels_tab[0][1] = put_pixels16_x2_iwmmxt;
    c->put_pixels_tab[0][2] = put_pixels16_y2_iwmmxt;
    c->put_pixels_tab[0][3] = put_pixels16_xy2_iwmmxt;
    c->put_no_rnd_pixels_tab[0][0] = put_pixels16_iwmmxt;
    c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_iwmmxt;
    c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_iwmmxt;
    c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_iwmmxt;

    c->put_pixels_tab[1][0] = put_pixels8_iwmmxt;
    c->put_pixels_tab[1][1] = put_pixels8_x2_iwmmxt;
    c->put_pixels_tab[1][2] = put_pixels8_y2_iwmmxt;
    c->put_pixels_tab[1][3] = put_pixels8_xy2_iwmmxt;
    c->put_no_rnd_pixels_tab[1][0] = put_pixels8_iwmmxt;
    c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_iwmmxt;
    c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_iwmmxt;
    c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_iwmmxt;

    c->avg_pixels_tab[0][0] = avg_pixels16_iwmmxt;
    c->avg_pixels_tab[0][1] = avg_pixels16_x2_iwmmxt;
    c->avg_pixels_tab[0][2] = avg_pixels16_y2_iwmmxt;
    c->avg_pixels_tab[0][3] = avg_pixels16_xy2_iwmmxt;
    c->avg_no_rnd_pixels_tab[0][0] = avg_pixels16_iwmmxt;
    c->avg_no_rnd_pixels_tab[0][1] = avg_no_rnd_pixels16_x2_iwmmxt;
    c->avg_no_rnd_pixels_tab[0][2] = avg_no_rnd_pixels16_y2_iwmmxt;
    c->avg_no_rnd_pixels_tab[0][3] = avg_no_rnd_pixels16_xy2_iwmmxt;

    c->avg_pixels_tab[1][0] = avg_pixels8_iwmmxt;
    c->avg_pixels_tab[1][1] = avg_pixels8_x2_iwmmxt;
    c->avg_pixels_tab[1][2] = avg_pixels8_y2_iwmmxt;
    c->avg_pixels_tab[1][3] = avg_pixels8_xy2_iwmmxt;
    c->avg_no_rnd_pixels_tab[1][0] = avg_no_rnd_pixels8_iwmmxt;
    c->avg_no_rnd_pixels_tab[1][1] = avg_no_rnd_pixels8_x2_iwmmxt;
    c->avg_no_rnd_pixels_tab[1][2] = avg_no_rnd_pixels8_y2_iwmmxt;
    c->avg_no_rnd_pixels_tab[1][3] = avg_no_rnd_pixels8_xy2_iwmmxt;

As you see, machines that support iwmmxt have all the motion compensation related functions implemented in hand optimized assembly. It is strange that it only results in about 2% improvement.

Quote

The c7x0 models would benefit from people helping the libw100 project.

I see, but I can't provide any help here as I don't have any hardware but Nokia 770, more people interested in improving mplayer performance on different ARM devices are welcome here

I can only do assembly optimizations for ffmpeg using armv5te instruction set (including fast single cycle multiply dsp instructions).

Concerning the current progress, I have done some modification to valgrind (callgrind part) to make it simulate read-allocate cache behaviour (arm926 uses such cache) and now have some information about parts of code that cause many cache missed and do lots of work with the memory.

Things that may need optimizations and provide some improvement are:

idct
motion compensation (for non iwmmxt devices)
dct_unquantize_h263_intra function (it contains almost 7% of instructions executed from callgrind statistics for this Doom video fragment, in addition it contains lots of multiplications which can be accelerated using dsp instructions), one more proof that it is needed to be optimized is that x86 code also contains mmx version of this function

Also I can prepare some small test programs for synthetic benchmarking of all these parts of code (idct, motion compensation, unquantize) so that it will be easier to see if there is any effect of optimizations. It is hard to notice any substantial effects of each one of these optimizations when just monitoring full video decoding time, but they all are cumulative and all together can provide quite a visible improvement. I have already done something like this when tried to optimize idct code (not too successful attempt because it focused on the code that was not real bottleneck, rows processing in idct generally takes much less time than columns):
http://lists.mplayerhq.hu/pipermail/ffmpeg...ber/045837.html

Would anyone want to try running these benchmarks, or take some more active part in optimizing mplayer/ffmpeg?

PS. By the way, is it possible to watch that Doom video clip without (much) framedrops on nonoverclocked Zaurus?

danboid · « **Reply #17 on:** January 01, 2007, 03:29:50 am »

Hi Serge!

I'm willing to do some more benchmarking if it will assist mplayer ARM development

Civil · « **Reply #18 on:** January 01, 2007, 05:24:43 am »

Quote

CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math"

There is no "-O4". Maximum optimization is -O3. And be careful with it. Sometimes it is better to use -O2 or even -Os for performance... If you do more optimization - binary grows lager.... And -fomit-frame-pointer is enabled in -O, -O2, -O3, -Os
On ARM version of GCC there is a little difference (acording to man gcc) betwen -mcpu=iwmmxt and -mtune=iwmmxt. So for max. performance it is good to use both.
http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...ptimize-Options
http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...tml#ARM-Options

Quote

-mtune=name
This option is very similar to the -mcpu= option, except that instead of specifying the actual target processor type, and hence restricting which instructions can be used, it specifies that GCC should tune the performance of the code as if the target were of the type specified in this option, but still choosing the instructions that it will generate based on the cpu specified by a -mcpu= option. For some ARM implementations better performance can be obtained by using this option.

Serge · « **Reply #19 on:** January 01, 2007, 06:00:04 am »

civil: http://www.hpc.ru/board/viewtopic.php?t=99079&start=10
Please read my old reply to the same your old question in Russian. I tried to use some online web translator, but the result is not very much readable: http://www.online-translator.com/url/tran_...=0&psubmit2.y=0

Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results. Unfortunately you did not provide any benchmarks even after you have been asked for it. I would appreciate if we keep discussion constructive and friendly here and don't start discussing some theoretical matters about how gcc is supposed to work. Thanks.

danboid · « **Reply #20 on:** January 01, 2007, 06:29:53 am »

Yeah Civil, be civil

(Sorry, couldn't resist )

Civil · « **Reply #21 on:** January 01, 2007, 06:47:10 am »

Serge
It was just comments... I don't know english so well to make correct senteces, so I write as I can...

Quote

Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results.

I'll try to compile mplyaer 1.0 rc1 with different options:
1) -O2 -mtune=iwmmxt -mcpu=iwmmxt
2) -O3 -mtune=iwmmxt -mcpu=iwmmxt
3) -O3 -mtune=iwmmxt -mcpu=iwmmxt -fomit-frame-pointer
and maybe with others. It depends on time wich it'll take to compile mplayer on Z. And then I'll post becnhmark results here, in this post. And then I'll post results wich I've got using mplayer from cacko.

Serge · « **Reply #22 on:** January 01, 2007, 09:37:26 pm »

Done some patch for 'dct_unquantize_h263_intra' function today:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050356.html

It should be useful for armv5te devices which do not have iwmmxt support (for Nokia 770 and probably for XScale chips older than PXA27x). This dct_unquantize_h263_intra function takes about 7% of decoding for Doom.xvid trailer, optimizing this function provides a visible performance improvement at least for this particular video file

Probably it can be optimized even more and a better final version of this patch will be available a bit later.

Serge · « **Reply #23 on:** January 02, 2007, 12:32:06 pm »

OK, committed 'dct_unquantize_h263_intra' optimization to maemo mplayer svn. It would be interesting to see the results of running 'test-unquantize' test program to benchmark how it behaves on XScale. Some details about the results from Nokia 770 are here: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html

Here are some step by step instructions:
1. Checkout maemo mplayer svn: 'svn co https://garage.maemo.org/svn/mplayer/trunk maemo-mplayer'
2. Go to 'maemo-mplayer/libavcodec/tests'
3. Compile the test program using supplied makefile (you will need to set CC and CFLAGS variables according to the name of your compiler and preferred optimizations settings), you can check 'build-tests-n770.sh' as an example of settings for compiling this test program for Nokia 770 (using crosscompiler from gentoo crossdev)
4. Run test program on your device and post the results here

This optimization may be useful for PXA255 or other XScale chips that do not have iwmmx support (do I understand that correctly?). This 'dct_unquantize_h263' function also has iwmmxt optimized implementation in mplayer and it should be used on the latest xscale chips (and SIMD instructions from iwmmxt should be much better for this kind of code). By the way, absence of iwmmxt support could also explain very poor results from PXA255 box provided by koen. Can somebody investigate what's the matter as not everything is clear yet?

Serge · « **Reply #24 on:** January 06, 2007, 11:24:13 am »

Well, some more optimizations for h263 unquantizer, I think it is a final version and it is hardly possible to optimize it more (for armv5te)

Test from Nokia 770:

Code: [Select]

/media/mmc1 $ ./test-unquantize
no cpu clock frequency specified, trying to autodetect it...
... detected as 251.2MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.07063 usec per element, or 17.7 cycles (251.2MHz)
dct_unquantize_h263_special_helper_armv5te time=0.02692 usec per element, or 6.8 cycles (251.2MHz)

I wonder how it performs on XScale per clock as loads are now done as 64-bits at a time using LDRD instruction (see my previous post about the details how to run the test).

PS. Thanks to koen for running previous benchmark, it showed that assembly optimized code for dct_unquantize_h263 is also roughly 2x faster than gcc generated code on XScale. But it would be interesting to see some results with this final patch.

Edit: Result for 400MHz XScale cpu (from koen):

Code: [Select]

root@h2200:/data/site/mplayer/libavcodec/tests# ./test-unquantize 400; ./test-unquantize 
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.04329 usec per element, or 17.3 cycles (400.0MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01671 usec per element, or 6.7 cycles (400.0MHz)
no cpu clock frequency specified, trying to autodetect it...
... detected as 376.7MHz
running correctness tests...
running performance tests...
dct_unquantize_h263_helper_c time=0.04277 usec per element, or 16.1 cycles (376.7MHz)
dct_unquantize_h263_special_helper_armv5te time=0.01655 usec per element, or 6.2 cycles (376.7MHz)

Serge · « **Reply #25 on:** January 08, 2007, 05:29:56 pm »

Just for additional statistics, 'Doom benchmark' for Nokia N800 (keep in mind that MPlayer is not optimized for ARMv6 SIMD instructions at all right now, so these results have a good potential for improving):

Code: [Select]

mplayer -benchmark -lavdopts idct=16 -nosound -vo null -loop 5 -quiet Doom.divx
BENCHMARKs: VC:  47.556s VO:   0.069s A:   0.000s Sys:   0.634s =   48.259s
BENCHMARKs: VC:  48.413s VO:   0.071s A:   0.000s Sys:   0.618s =   49.101s
BENCHMARKs: VC:  48.561s VO:   0.073s A:   0.000s Sys:   0.593s =   49.228s
BENCHMARKs: VC:  48.731s VO:   0.072s A:   0.000s Sys:   0.624s =   49.427s
BENCHMARKs: VC:  49.398s VO:   0.072s A:   0.000s Sys:   0.633s =   50.102s

Serge · « **Reply #26 on:** January 17, 2007, 06:37:05 pm »

Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml

As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized

Meanie · « **Reply #27 on:** January 17, 2007, 07:40:39 pm »

Quote

Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml

As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized
[div align=\"right\"][a href=\"index.php?act=findpost&pid=151461\"][{POST_SNAPBACK}][/a][/div]

There are several flavours of Zaurus OS which all have different hard/soft float requirements. The default Sharp ROM (and also Cacko ROM) use hardfloat. The pdaXrom distribution for Zaurus uses softvfp. OZ (OpenZaurus) uses yet another variant of softfloat...
The latest builds of mplayer rc1 were mainly build for pdaXrom.

Serge · « **Reply #28 on:** January 22, 2007, 05:30:37 pm »

Here is a new progress update report I have implemented an initial version of JIT accelerated scaler for planar YUV420 -> packed YUV422 color format. It provides a very nice performance improvement for Nokia 770 already in a new mplayer build for maemo: mplayer_1.0rc1-maemo.8

I will try to get this code integrated into upstream ffmpeg library so that other ARM devices (such as PXA270?) could make use of it and have all the performance problems with scaling solved. Here is a link with some more information, it also includes benchmark results (using the same Doom video clip): http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/051209.html

lardman · « **Reply #29 on:** January 22, 2007, 05:55:49 pm »

Serge,

I'll build your comparison benchmarks for the PXA255 (and SA1110 if it's of interest) once I've got over some minor (I hope) OE build issues.

Si

News:

Author Topic: Mplayer Development And Optimization For Arm (Read 91452 times)