![]() ![]() |
Jan 28 2007, 11:58 AM
Post
#31
|
|
|
Group: Members Posts: 103 Joined: 22-August 05 From: Moscow, Russia. Member No.: 7,924 |
QUOTE Do you need any assistance in benchmarking? I could probably build some mplayer binaries with different optimization options for zaurus if it is too hard for you. Just Haven't got enough time for tests (exams...). Default compiler options ( -O4 -pipe -ffasth-math -fomit-frame-pointer ): BENCHMARKs: VC: 52.561s VO: 0.065s A: 0.000s Sys: 0.793s = 53.419s BENCHMARKs: VC: 56.284s VO: 0.066s A: 0.000s Sys: 0.795s = 57.145s BENCHMARKs: VC: 56.476s VO: 0.065s A: 0.000s Sys: 0.797s = 57.339s BENCHMARKs: VC: 56.319s VO: 0.065s A: 0.000s Sys: 0.796s = 57.180s BENCHMARKs: VC: 56.434s VO: 0.065s A: 0.000s Sys: 0.799s = 57.290s -O2 -pipe -march=iwmmxt -mcpu=iwmmxt -mtune=iwmmxt -msoft-float: BENCHMARKs: VC: 53.703s VO: 0.066s A: 0.000s Sys: 0.915s = 54.685s BENCHMARKs: VC: 56.455s VO: 0.066s A: 0.000s Sys: 0.803s = 57.324s BENCHMARKs: VC: 56.513s VO: 0.066s A: 0.000s Sys: 0.799s = 57.377s BENCHMARKs: VC: 56.458s VO: 0.065s A: 0.000s Sys: 0.798s = 57.322s BENCHMARKs: VC: 56.456s VO: 0.065s A: 0.000s Sys: 0.800s = 57.321s P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...) P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus... |
|
|
|
Jan 28 2007, 12:53 PM
Post
#32
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(Civil @ Jan 28 2007, 11:58 AM) P.S. mplayer compiled without iwmmxt support. System is running at 416MHz (PXA270). Kernel 2.6.19.2, system compilled with eabi and with -march, -mtune and -mcpu=iwmmxt. GCC 4.1.1, Glibc 2.5 (Gentoo 2006.1). If anyoune interested ( I don't know why Mesk don't whant to post about his progress with gentoo for zaurus here...) Thanks for running these tests. It shows that the results for -O3 (-O4) are pretty much the same as -O2, it would be interesting to compare them against -Os as this option is most commonly used on embedded devices. By the way, why iwmmxt was not used? It should provide quite a noticeable improvement, at least theoreticaly QUOTE P.S.S. Tested with: mplayer -loop 5 -quiet -benchmark -nosound -vo null -lavdopts idct=16 Doom.divx P.S.S.S. Later I'll add bechmarks with other CFLags. It took a lot of time to recompile mplayer on zaurus... Thanks, I'm anticipating more test results. While compiler optimization options are unlikely to provide big improvement, but every little bit helps. |
|
|
|
Jan 28 2007, 12:55 PM
Post
#33
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(adf @ Jan 28 2007, 12:36 PM) apologies for straying off topic- I'm actually interested in the mplayer work. BUT--the I followed the gentoo link in the last post. if progress is being made, it certainly desrves some attention. A mainstream distro like gento that compiles and runs on a Z (well optimized, etc) has been a sort of holy grail for quite a few zaurus users. By all means encourage the people working on the project to post here Wouldn't it be better to create a new topic for discussing gentoo on zaurus? |
|
|
|
Jan 28 2007, 01:04 PM
Post
#34
|
|
|
Group: Members Posts: 103 Joined: 22-August 05 From: Moscow, Russia. Member No.: 7,924 |
QUOTE Wouldn't it be better to create a new topic for discussing gentoo on zaurus? smile.gif Otherwise we risk to turn this topic into a mess. I'm not discussing... And I'm not a developer, so I think author (Mesk) must post about it. I've posted just basic info for people to know about system I'm running now. |
|
|
|
Feb 14 2007, 01:57 PM
Post
#35
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
Some more mplayer related news, mplayer port for maemo should now be more or less usable on Nokia N800 (video freeze issues fixed by using video output code with direct framebuffer access just like on Nokia 770). Once accommodation to this new device is finished, code optimization activity will be resumed
|
|
|
|
Mar 14 2007, 07:39 AM
Post
#36
|
|
|
Group: Members Posts: 14 Joined: 9-May 06 Member No.: 9,821 |
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code. I have done this for the results below.
Here are my benchmark results on a standard Sl-C3200, not overclocked, running open zaurus: BENCHMARKs: VC: 44.056s VO: 0.078s A: 0.000s Sys: 0.831s = 44.965s BENCHMARK%: VC: 97.9787% VO: 0.1734% A: 0.0000% Sys: 1.8479% = 100.0000% BENCHMARKs: VC: 43.234s VO: 0.079s A: 0.000s Sys: 0.816s = 44.128s BENCHMARK%: VC: 97.9734% VO: 0.1785% A: 0.0000% Sys: 1.8481% = 100.0000% BENCHMARKs: VC: 43.487s VO: 0.076s A: 0.000s Sys: 0.813s = 44.376s BENCHMARK%: VC: 97.9957% VO: 0.1715% A: 0.0000% Sys: 1.8328% = 100.0000% BENCHMARKs: VC: 43.669s VO: 0.076s A: 0.000s Sys: 0.820s = 44.565s BENCHMARK%: VC: 97.9891% VO: 0.1712% A: 0.0000% Sys: 1.8398% = 100.0000% BENCHMARKs: VC: 43.497s VO: 0.078s A: 0.000s Sys: 0.810s = 44.386s BENCHMARK%: VC: 97.9976% VO: 0.1764% A: 0.0000% Sys: 1.8260% = 100.0000% Tim |
|
|
|
Mar 14 2007, 08:14 AM
Post
#37
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(tjchick @ Mar 14 2007, 07:39 AM) Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code. Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure? Did you try using something similar to what I suggested in this thread before? CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure make |
|
|
|
Mar 14 2007, 08:29 AM
Post
#38
|
|
|
Group: Members Posts: 14 Joined: 9-May 06 Member No.: 9,821 |
QUOTE(Serge @ Mar 14 2007, 05:14 PM) QUOTE(tjchick @ Mar 14 2007, 07:39 AM) Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code. Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure? Did you try using something similar to what I suggested in this thread before? CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure make Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this: if( mm_flags & MM_IWMMXT ) -> install dsp code. It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow? That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement. Tim |
|
|
|
Mar 14 2007, 08:53 AM
Post
#39
|
|
![]() Group: Members Posts: 2,808 Joined: 21-March 05 From: Sydney, Australia Member No.: 6,686 |
QUOTE(tjchick @ Mar 15 2007, 02:29 AM) QUOTE(Serge @ Mar 14 2007, 05:14 PM) QUOTE(tjchick @ Mar 14 2007, 07:39 AM) Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code. Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure? Did you try using something similar to what I suggested in this thread before? CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure make Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this: if( mm_flags & MM_IWMMXT ) -> install dsp code. It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow? That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement. Tim if you pull latest source from svn, you can just use --enable-iwmmxt |
|
|
|
Mar 14 2007, 09:32 AM
Post
#40
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(tjchick @ Mar 14 2007, 08:29 AM) Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this: if( mm_flags & MM_IWMMXT ) -> install dsp code. It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow? That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement. Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream. If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems. PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1 edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all. And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start |
|
|
|
Mar 15 2007, 01:51 AM
Post
#41
|
|
|
Group: Members Posts: 14 Joined: 9-May 06 Member No.: 9,821 |
[quote=Serge,Mar 14 2007, 06:32 PM]
Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream. [\quote] I already did this stuff yesteday, before I saw your messages. Yes Meanie, even latest SVN does not fix matters. I posted a patch to the ffmpeg dev mailing list, got some feedback and posted another patch. Am awaiting the response. [quote] If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems. PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1 edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all. And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start [/quote] Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it. I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase. So either the version which was in use was pretty good (be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output), or the system is memory bound as others have suggested. It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache) Here are the new results: BENCHMARKs: VC: 43.497s BENCHMARKs: VC: 42.813s BENCHMARKs: VC: 43.040s BENCHMARKs: VC: 43.269s BENCHMARKs: VC: 43.090s Thanks, Tim |
|
|
|
Mar 15 2007, 10:52 AM
Post
#42
|
|
|
Group: Members Posts: 51 Joined: 8-October 06 Member No.: 11,724 |
QUOTE(tjchick @ Mar 15 2007, 01:51 AM) Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it. That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance. QUOTE I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase. This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html QUOTE So either the version which was in use was pretty good It was just not performance critical, I wonder why you even managed to see some improvement QUOTE (be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output), Actually I find compiler generated code for ARM quite poorly optimized. It can't make the good use of conditionally executed instructions, can't use DSP instructions, schedule code in an optimal way to avoid pipeline stalls. Of course, it only makes sense optimizing code that is bottleneck to gain any visible performance improvement overall. I prefer to always develop some simple performance and correctness tests for the performance critical functions I'm optimizing. So I can ensure that they really provide performance improvement and do not introduce stability issues. Random assembly hacking is not a productive way of working for sure QUOTE or the system is memory bound as others have suggested. This particular function is run on fully cached data, so memory access time is not important here. I investigated mplayer memory access pattern using valgrind (callgrind tool) getting more or less precise information about cache misses. Code that heavily depends on memory performance is in motion compensation functions and partially idct (cache write misses for destination buffer). QUOTE It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache) Yes, paying special attention at accessing memory properly and using prefetch can improve performance quite noticeably. PS. In order to ensure that video is decoded not only fast, but also right, you can use '-vo md5' option. I noticed some really ugly video decoding artefacts when using standard ARM optimized IDCT (some vertical stripes on panning scenes), ARMv5TE optimized IDCT is identical to C implementation. |
|
|
|
Mar 15 2007, 12:05 PM
Post
#43
|
|
|
Group: Members Posts: 14 Joined: 9-May 06 Member No.: 9,821 |
QUOTE(Serge @ Mar 15 2007, 07:52 PM) QUOTE(tjchick @ Mar 15 2007, 01:51 AM) Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it. That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance. I think it does, as I know the cacko mplayer-atty is faster again than "mine", and that uses the IPP stuff for idct. I was not really interested in trying it though, due to the license restrictions of IPP. QUOTE QUOTE I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase. This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770: On thing I'm going to do is compare the iwmmxt code against your armv5te code, performance wise. Cheers, Tim |
|
|
|
Mar 15 2007, 03:39 PM
Post
#44
|
|
![]() Group: Members Posts: 2,808 Joined: 21-March 05 From: Sydney, Australia Member No.: 6,686 |
actually, i think your new build is much faster than atty's in decoding speed.
here is the benchmarks result of running atty's iwmmxt optimized build of mplayer on C3000 with pdaXrom BENCHMARKs: VC: 40.385s VO: 0.068s A: 0.000s Sys: 0.863s = 41.315s BENCHMARKs: VC: 47.495s VO: 0.067s A: 0.000s Sys: 0.860s = 48.421s BENCHMARKs: VC: 45.600s VO: 0.067s A: 0.000s Sys: 0.843s = 46.509s BENCHMARKs: VC: 45.629s VO: 0.068s A: 0.000s Sys: 0.865s = 46.562s BENCHMARKs: VC: 45.820s VO: 0.068s A: 0.000s Sys: 0.859s = 46.748s for comparison, here is the benchmark results of the SVN mplayer code with armv5te enabled and xscale tuning CC flags BENCHMARKs: VC: 52.105s VO: 0.026s A: 0.000s Sys: 1.047s = 53.178s BENCHMARKs: VC: 53.503s VO: 0.027s A: 0.000s Sys: 0.923s = 54.453s BENCHMARKs: VC: 54.030s VO: 0.027s A: 0.000s Sys: 0.914s = 54.970s BENCHMARKs: VC: 53.926s VO: 0.027s A: 0.000s Sys: 0.931s = 54.883s BENCHMARKs: VC: 53.267s VO: 0.034s A: 0.000s Sys: 0.927s = 54.228s |
|
|
|
Mar 21 2007, 07:10 AM
Post
#45
|
|
|
Group: Members Posts: 14 Joined: 9-May 06 Member No.: 9,821 |
On cacko on c1000, I see:
VC: 36.186 VC: 36.927 VC: 37.662 VC: 36.932 VC: 37.016 And similar figures for sys. Cacko uses attys mplayer, which still seems to be the best by quite a margin! At a guess this is due to IPP for IDCT. Thanks, Tim |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 23rd May 2013 - 10:39 AM |