![]() ![]() |
Dec 27 2006, 02:16 PM
Post
#16
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Thanks for running benchmarks. They show that these armv5te optimizations for idct are useful for xscale too. I was just unsure if it is possible to develop a shared code that runs fine on both arm926 and xscale or have to implement two different versions. I'll try to optimize this idct further as much as possible primarily for arm926, but will also keep in mind that this code is also useful on xscale and will take this into account
QUOTE(koen @ Dec 27 2006, 01:27 AM) I ran the benchmark on my ipaq h2200 (400MHz pxa255) and I can see that the memory bus is a bottleneck, since the 770 and pxa270 machines run the bus at a higher speed. That's interesting. If memory performance is really very important for mplayer, probably it should be possible to find the parts of code with heavy memory use and optimize memory access patterns for better cache and memory bus utilization. I have already done some tests trying to figure out how to make best use of memory bandwidth on Nokia 770 some time ago: http://maemo.org/pipermail/maemo-developer...ber/006579.html This information can turn out to be very useful for further optimizations QUOTE If that isn't the case, arm926 cores kick xscale ass Well, arm926 core should be somewhat faster per clock, here are some links to optimization docs for different arm flavours: http://www.internettablettalk.com/forums/s...read.php?t=2406 But I expected that 416MHz should be still a lot faster because of higher cpu clock frequency. Maybe memory performance is really a limiting factor here and it makes performance of all these chips closer to each other. Another possible explanation could be nonoptimal set of optimization options or older version of gcc for zaurus builds of mplayer. It should be relatively easy to test mplayer with a different set of optimization options. You can take upstream mplayer 1.0rc1 tarball and compile it using: CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure make It may have some problems with video/audio output drivers if compiled without zaurus specific patches, but this should not be a problem for testing decoding capabilities only |
|
|
|
Dec 31 2006, 12:40 PM
Post
#17
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(koen @ Dec 25 2006, 04:16 AM) The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement. That seems a bit too low, I wonder if mplayer was configured and compiled correctly. The point is that motion compensation code in mplayer is currently much better optimized for iwmmxt (that all work was done by atty). You can just look into mplayer sources. Here is the code used for ARM without iwmmx (libavcodec/armv4l/dsputil_arm.c): CODE /* c->put_pixels_tab[0][0] = put_pixels16_arm; */ // NG! c->put_pixels_tab[0][1] = put_pixels16_x2_arm; //OK! c->put_pixels_tab[0][2] = put_pixels16_y2_arm; //OK! /* c->put_pixels_tab[0][3] = put_pixels16_xy2_arm; /\* NG *\/ */ /* c->put_no_rnd_pixels_tab[0][0] = put_pixels16_arm; */ c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_arm; // OK c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_arm; //OK /* c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_arm; //NG */ c->put_pixels_tab[1][0] = put_pixels8_arm; //OK c->put_pixels_tab[1][1] = put_pixels8_x2_arm; //OK /* c->put_pixels_tab[1][2] = put_pixels8_y2_arm; //NG */ /* c->put_pixels_tab[1][3] = put_pixels8_xy2_arm; //NG */ c->put_no_rnd_pixels_tab[1][0] = put_pixels8_arm;//OK c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_arm; //OK c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_arm; //OK /* c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_arm;//NG */ Compare it with the following (libavcodec/armv4l/dsputil_iwmmxt.c): CODE c->put_pixels_tab[0][0] = put_pixels16_iwmmxt; c->put_pixels_tab[0][1] = put_pixels16_x2_iwmmxt; c->put_pixels_tab[0][2] = put_pixels16_y2_iwmmxt; c->put_pixels_tab[0][3] = put_pixels16_xy2_iwmmxt; c->put_no_rnd_pixels_tab[0][0] = put_pixels16_iwmmxt; c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_iwmmxt; c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_iwmmxt; c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_iwmmxt; c->put_pixels_tab[1][0] = put_pixels8_iwmmxt; c->put_pixels_tab[1][1] = put_pixels8_x2_iwmmxt; c->put_pixels_tab[1][2] = put_pixels8_y2_iwmmxt; c->put_pixels_tab[1][3] = put_pixels8_xy2_iwmmxt; c->put_no_rnd_pixels_tab[1][0] = put_pixels8_iwmmxt; c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_iwmmxt; c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_iwmmxt; c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_iwmmxt; c->avg_pixels_tab[0][0] = avg_pixels16_iwmmxt; c->avg_pixels_tab[0][1] = avg_pixels16_x2_iwmmxt; c->avg_pixels_tab[0][2] = avg_pixels16_y2_iwmmxt; c->avg_pixels_tab[0][3] = avg_pixels16_xy2_iwmmxt; c->avg_no_rnd_pixels_tab[0][0] = avg_pixels16_iwmmxt; c->avg_no_rnd_pixels_tab[0][1] = avg_no_rnd_pixels16_x2_iwmmxt; c->avg_no_rnd_pixels_tab[0][2] = avg_no_rnd_pixels16_y2_iwmmxt; c->avg_no_rnd_pixels_tab[0][3] = avg_no_rnd_pixels16_xy2_iwmmxt; c->avg_pixels_tab[1][0] = avg_pixels8_iwmmxt; c->avg_pixels_tab[1][1] = avg_pixels8_x2_iwmmxt; c->avg_pixels_tab[1][2] = avg_pixels8_y2_iwmmxt; c->avg_pixels_tab[1][3] = avg_pixels8_xy2_iwmmxt; c->avg_no_rnd_pixels_tab[1][0] = avg_no_rnd_pixels8_iwmmxt; c->avg_no_rnd_pixels_tab[1][1] = avg_no_rnd_pixels8_x2_iwmmxt; c->avg_no_rnd_pixels_tab[1][2] = avg_no_rnd_pixels8_y2_iwmmxt; c->avg_no_rnd_pixels_tab[1][3] = avg_no_rnd_pixels8_xy2_iwmmxt; As you see, machines that support iwmmxt have all the motion compensation related functions implemented in hand optimized assembly. It is strange that it only results in about 2% improvement. QUOTE The c7x0 models would benefit from people helping the libw100 project. I see, but I can't provide any help here as I don't have any hardware but Nokia 770, more people interested in improving mplayer performance on different ARM devices are welcome here I can only do assembly optimizations for ffmpeg using armv5te instruction set (including fast single cycle multiply dsp instructions). Concerning the current progress, I have done some modification to valgrind (callgrind part) to make it simulate read-allocate cache behaviour (arm926 uses such cache) and now have some information about parts of code that cause many cache missed and do lots of work with the memory. Things that may need optimizations and provide some improvement are:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ber/045837.html Would anyone want to try running these benchmarks, or take some more active part in optimizing mplayer/ffmpeg? PS. By the way, is it possible to watch that Doom video clip without (much) framedrops on nonoverclocked Zaurus? |
|
|
|
Jan 1 2007, 12:29 AM
Post
#18
|
|
![]() Group: Members Posts: 682 Joined: 26-December 05 From: Rochdale, Lancashire Member No.: 8,789 |
Hi Serge!
I'm willing to do some more benchmarking if it will assist mplayer ARM development |
|
|
|
Jan 1 2007, 02:24 AM
Post
#19
|
|
|
Group: Members Posts: 103 Joined: 22-August 05 From: Moscow, Russia. Member No.: 7,924 |
QUOTE CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" There is no "-O4". Maximum optimization is -O3. And be careful with it. Sometimes it is better to use -O2 or even -Os for performance... If you do more optimization - binary grows lager.... And -fomit-frame-pointer is enabled in -O, -O2, -O3, -Os On ARM version of GCC there is a little difference (acording to man gcc) betwen -mcpu=iwmmxt and -mtune=iwmmxt. So for max. performance it is good to use both. http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...ptimize-Options http://gcc.gnu.org/onlinedocs/gcc-3.4.6/gc...tml#ARM-Options QUOTE -mtune=name
This option is very similar to the -mcpu= option, except that instead of specifying the actual target processor type, and hence restricting which instructions can be used, it specifies that GCC should tune the performance of the code as if the target were of the type specified in this option, but still choosing the instructions that it will generate based on the cpu specified by a -mcpu= option. For some ARM implementations better performance can be obtained by using this option. |
|
|
|
Jan 1 2007, 03:00 AM
Post
#20
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
civil: http://www.hpc.ru/board/viewtopic.php?t=99079&start=10
Please read my old reply to the same your old question in Russian. I tried to use some online web translator, but the result is not very much readable: http://www.online-translator.com/url/tran_...=0&psubmit2.y=0 Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results. Unfortunately you did not provide any benchmarks even after you have been asked for it. I would appreciate if we keep discussion constructive and friendly here and don't start discussing some theoretical matters about how gcc is supposed to work. Thanks. |
|
|
|
Jan 1 2007, 03:29 AM
Post
#21
|
|
![]() Group: Members Posts: 682 Joined: 26-December 05 From: Rochdale, Lancashire Member No.: 8,789 |
Yeah Civil, be civil
(Sorry, couldn't resist |
|
|
|
Jan 1 2007, 03:47 AM
Post
#22
|
|
|
Group: Members Posts: 103 Joined: 22-August 05 From: Moscow, Russia. Member No.: 7,924 |
Serge
It was just comments... I don't know english so well to make correct senteces, so I write as I can... QUOTE Anyway, the summary is the following: suggestions for better compiler optimization options are very much welcome if they are confirmed by benchmark results. I'll try to compile mplyaer 1.0 rc1 with different options: 1) -O2 -mtune=iwmmxt -mcpu=iwmmxt 2) -O3 -mtune=iwmmxt -mcpu=iwmmxt 3) -O3 -mtune=iwmmxt -mcpu=iwmmxt -fomit-frame-pointer and maybe with others. It depends on time wich it'll take to compile mplayer on Z. And then I'll post becnhmark results here, in this post. And then I'll post results wich I've got using mplayer from cacko. |
|
|
|
Jan 1 2007, 06:37 PM
Post
#23
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Done some patch for 'dct_unquantize_h263_intra' function today:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050356.html It should be useful for armv5te devices which do not have iwmmxt support (for Nokia 770 and probably for XScale chips older than PXA27x). This dct_unquantize_h263_intra function takes about 7% of decoding for Doom.xvid trailer, optimizing this function provides a visible performance improvement at least for this particular video file Probably it can be optimized even more and a better final version of this patch will be available a bit later. |
|
|
|
Jan 2 2007, 09:32 AM
Post
#24
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
OK, committed 'dct_unquantize_h263_intra' optimization to maemo mplayer svn. It would be interesting to see the results of running 'test-unquantize' test program to benchmark how it behaves on XScale. Some details about the results from Nokia 770 are here: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html
Here are some step by step instructions: 1. Checkout maemo mplayer svn: 'svn co https://garage.maemo.org/svn/mplayer/trunk maemo-mplayer' 2. Go to 'maemo-mplayer/libavcodec/tests' 3. Compile the test program using supplied makefile (you will need to set CC and CFLAGS variables according to the name of your compiler and preferred optimizations settings), you can check 'build-tests-n770.sh' as an example of settings for compiling this test program for Nokia 770 (using crosscompiler from gentoo crossdev) 4. Run test program on your device and post the results here This optimization may be useful for PXA255 or other XScale chips that do not have iwmmx support (do I understand that correctly?). This 'dct_unquantize_h263' function also has iwmmxt optimized implementation in mplayer and it should be used on the latest xscale chips (and SIMD instructions from iwmmxt should be much better for this kind of code). By the way, absence of iwmmxt support could also explain very poor results from PXA255 box provided by koen. Can somebody investigate what's the matter as not everything is clear yet? |
|
|
|
Jan 6 2007, 08:24 AM
Post
#25
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Well, some more optimizations for h263 unquantizer, I think it is a final version and it is hardly possible to optimize it more (for armv5te)
Test from Nokia 770: CODE /media/mmc1 $ ./test-unquantize no cpu clock frequency specified, trying to autodetect it... ... detected as 251.2MHz running correctness tests... running performance tests... dct_unquantize_h263_helper_c time=0.07063 usec per element, or 17.7 cycles (251.2MHz) dct_unquantize_h263_special_helper_armv5te time=0.02692 usec per element, or 6.8 cycles (251.2MHz) I wonder how it performs on XScale per clock as loads are now done as 64-bits at a time using LDRD instruction (see my previous post about the details how to run the test). PS. Thanks to koen for running previous benchmark, it showed that assembly optimized code for dct_unquantize_h263 is also roughly 2x faster than gcc generated code on XScale. But it would be interesting to see some results with this final patch. Edit: Result for 400MHz XScale cpu (from koen): CODE root@h2200:/data/site/mplayer/libavcodec/tests# ./test-unquantize 400; ./test-unquantize
running correctness tests... running performance tests... dct_unquantize_h263_helper_c time=0.04329 usec per element, or 17.3 cycles (400.0MHz) dct_unquantize_h263_special_helper_armv5te time=0.01671 usec per element, or 6.7 cycles (400.0MHz) no cpu clock frequency specified, trying to autodetect it... ... detected as 376.7MHz running correctness tests... running performance tests... dct_unquantize_h263_helper_c time=0.04277 usec per element, or 16.1 cycles (376.7MHz) dct_unquantize_h263_special_helper_armv5te time=0.01655 usec per element, or 6.2 cycles (376.7MHz) |
|
|
|
Jan 8 2007, 02:29 PM
Post
#26
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Just for additional statistics, 'Doom benchmark' for Nokia N800 (keep in mind that MPlayer is not optimized for ARMv6 SIMD instructions at all right now, so these results have a good potential for improving):
CODE mplayer -benchmark -lavdopts idct=16 -nosound -vo null -loop 5 -quiet Doom.divx
BENCHMARKs: VC: 47.556s VO: 0.069s A: 0.000s Sys: 0.634s = 48.259s BENCHMARKs: VC: 48.413s VO: 0.071s A: 0.000s Sys: 0.618s = 49.101s BENCHMARKs: VC: 48.561s VO: 0.073s A: 0.000s Sys: 0.593s = 49.228s BENCHMARKs: VC: 48.731s VO: 0.072s A: 0.000s Sys: 0.624s = 49.427s BENCHMARKs: VC: 49.398s VO: 0.072s A: 0.000s Sys: 0.633s = 50.102s |
|
|
|
Jan 17 2007, 03:37 PM
Post
#27
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml
As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized |
|
|
|
Jan 17 2007, 04:40 PM
Post
#28
|
|
![]() Group: Members Posts: 2,808 Joined: 21-March 05 From: Sydney, Australia Member No.: 6,686 |
QUOTE(Serge @ Jan 18 2007, 09:37 AM) Hello again. I guess the benchmarks of -Os vs. -O2 and -O3 on zaurus for mplayer are not going anywhere. Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. I only need to know what configuration is needed for crossdev to build binaries for zaurus. For example for Nokia 770 it is arm-softfloat-linux-gnueabi. More details about possible choices for architecture and abi can be read here: http://www.gentoo.org/proj/en/base/embedde...development.xml As for the other news. The optimized dequantizer has been committed upstream, so it will be included in mplayer-1.0rc2 or whatever version gets released next. I'm currently trying to do some additional optimizations to color conversion and scaling for Nokia 770 (probably using JIT generated code for scaler, another option is to try making some use of C55x DSP core). Maybe I'll also try to do some optimizations for motion compensation code. Anyway, there are still lots of things that can be optimized There are several flavours of Zaurus OS which all have different hard/soft float requirements. The default Sharp ROM (and also Cacko ROM) use hardfloat. The pdaXrom distribution for Zaurus uses softvfp. OZ (OpenZaurus) uses yet another variant of softfloat... The latest builds of mplayer rc1 were mainly build for pdaXrom. |
|
|
|
Jan 22 2007, 02:30 PM
Post
#29
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Here is a new progress update report
I will try to get this code integrated into upstream ffmpeg library so that other ARM devices (such as PXA270?) could make use of it and have all the performance problems with scaling solved. Here is a link with some more information, it also includes benchmark results (using the same Doom video clip): http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/051209.html |
|
|
|
Jan 22 2007, 02:55 PM
Post
#30
|
|
|
Group: Members Posts: 4,515 Joined: 25-October 03 From: Bath, UK Member No.: 464 |
Serge,
I'll build your comparison benchmarks for the PXA255 (and SA1110 if it's of interest) once I've got over some minor (I hope) OE build issues. Si |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 18th June 2013 - 05:45 PM |