Author Topic: Mplayer Development And Optimization For Arm  (Read 85815 times)

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #45 on: March 21, 2007, 12:26:33 pm »
You can try to override idct by using '-lavdopts idct=<some_number>'  in atty's build and test it. After getting the numbers we can see if it is really IPP that matters, or maybe atty's build has some other optimizations.

By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)

koen

  • Hero Member
  • *****
  • Posts: 1008
    • View Profile
    • http://dominion.thruhere.net/koen/cms/
Mplayer Development And Optimization For Arm
« Reply #46 on: March 21, 2007, 12:42:45 pm »
Quote
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156919\"][{POST_SNAPBACK}][/a][/div]

Except that ARM has no immediate assignments and needs aligned data...
Forums are not bugtrackers!!! Smart questions
Ångström release team
iPAQ h2210, iPAQ h5550, iPAQ hx4700, Zaurus SL-C700, Nokia 770, all running some form of GPE
My blog

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #47 on: March 22, 2007, 02:56:02 pm »
Quote
Quote
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156919\"][{POST_SNAPBACK}][/a][/div]
Except that ARM has no immediate assignments
MMX instruction set does not have immediate assignments either  In any case, that's not a big deal.

Quote
and needs aligned data...
FFmpeg does special care for alignment, many functions have guaranteed alignment specified for the data they are processing (some SSE instructions require 16-byte alignment after all, so ARM is not the most strict in this respect). Input data for IDCT is also 16-byte aligned for example, that's more than enough for ARM

Anyway, somebody just needs to give it a try. To encourage you more and prove that it might work, looks like atty took the existing MMX implementation of dct_unquantize_h263_intra_mmx and converted it to dct_unquantize_h263_intra_iwmmxt  Probably he did not care about IDCT as he could just use IPP instead, so maybe doing a conversion from MMX to IWMMXT for IDCT is also possible with not so much work (everything is relative of course). I wonder what implementation would be faster? On one hand IPP is a library developed by professionals from Intel, on the other hand FFmpeg proved to be very well optimized beating many other codecs on x86 platform and default IDCT used in it is MMX optimized.
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)

koen

  • Hero Member
  • *****
  • Posts: 1008
    • View Profile
    • http://dominion.thruhere.net/koen/cms/
Mplayer Development And Optimization For Arm
« Reply #48 on: March 22, 2007, 05:35:01 pm »
Quote
Quote
Quote
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156919\"][{POST_SNAPBACK}][/a][/div]
Except that ARM has no immediate assignments
MMX instruction set does not have immediate assignments either  In any case, that's not a big deal.

Quote
and needs aligned data...
FFmpeg does special care for alignment, many functions have guaranteed alignment specified for the data they are processing (some SSE instructions require 16-byte alignment after all, so ARM is not the most strict in this respect). Input data for IDCT is also 16-byte aligned for example, that's more than enough for ARM

[div align=\"right\"][a href=\"index.php?act=findpost&pid=157000\"][{POST_SNAPBACK}][/a][/div]

Right, o-hand ported the fbmmx layer in the xserver to iwmmx but it wasn't faster since you had to align the data by hand. Maybe ffmpeg can gain more.
Forums are not bugtrackers!!! Smart questions
Ångström release team
iPAQ h2210, iPAQ h5550, iPAQ hx4700, Zaurus SL-C700, Nokia 770, all running some form of GPE
My blog

tjchick

  • Newbie
  • *
  • Posts: 14
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #49 on: March 23, 2007, 06:00:21 pm »
Quote
You can try to override idct by using '-lavdopts idct=<some_number>'  in atty's build and test it. After getting the numbers we can see if it is really IPP that matters, or maybe atty's build has some other optimizations.

I did try it, and using the non-IPP IDCT produces results which are comparable ish. atty mplayer is still faster by 10% or so, so there are still a few more tweaks I need to sort out, but it was 40% better when using ipp.

Cheers,
Tim

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #50 on: July 14, 2007, 05:16:53 pm »
Hi, I'm working on further optimizing ARMv5 IDCT for mplayer/ffmpeg. Older implementation from mplayer 1.0rc1 was only optimized for ARM9E cores. Now it should get noticeably faster on long pipeline cores such as XScale (Sharp Zaurus) and ARM11 (Nokia N800).

Can anybody compile and run the following test on XScale:

> svn checkout https://garage.maemo.org/svn/mplayer/trunk/libavcodec
> cd libavcodec/tests
> make test-idct

You may need to specify the name of your crosscompiler when running make (ex. 'CC="arm-softfloat-linux-gnueabi-gcc" make test-idct')

After that please copy 'test-idct' bunary to your device and run it specifying cpu clock frequency in the command line (for 416MHz Zaurus it would be './test-idct --freq=416')

For those who are curious, here are the results from running this test on Nokia 770:
Code: [Select]
> ./test-idct --freq=252
Assuming cpu clock frequency 252MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=886.0
simple_idct_put_armv5te  cache=no,  time=1062.2
simple_idct_put_armv5te  cache=yes, time=1032.8
simple_idct_add_armv5te  cache=no,  time=1323.7
simple_idct_add_armv5te  cache=yes, time=1186.2
simple_idct_armv5te_ref  time=1041.8
simple_idct_put_armv5te_ref  cache=no,  time=1257.6
simple_idct_put_armv5te_ref  cache=yes, time=1253.0
simple_idct_add_armv5te_ref  cache=no,  time=1561.9
simple_idct_add_armv5te_ref  cache=yes, time=1445.6
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1423.4
simple_idct_put_armv5te  cache=no,  time=1665.7
simple_idct_put_armv5te  cache=yes, time=1655.3
simple_idct_add_armv5te  cache=no,  time=1934.6
simple_idct_add_armv5te  cache=yes, time=1783.8
simple_idct_armv5te_ref  time=1698.6
simple_idct_put_armv5te_ref  cache=no,  time=1914.0
simple_idct_put_armv5te_ref  cache=yes, time=1911.6
simple_idct_add_armv5te_ref  cache=no,  time=2221.2
simple_idct_add_armv5te_ref  cache=yes, time=2098.9

Results for Nokia N800:
Code: [Select]
> ./test-idct --freq=330 --enable-armv6
Assuming cpu clock frequency 330MHz (ARMv6 enabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=751.3
simple_idct_put_armv5te  cache=no,  time=947.7
simple_idct_put_armv5te  cache=yes, time=866.9
simple_idct_add_armv5te  cache=no,  time=1099.2
simple_idct_add_armv5te  cache=yes, time=937.6
simple_idct_armv5te_ref  time=1084.5
simple_idct_put_armv5te_ref  cache=no,  time=1288.4
simple_idct_put_armv5te_ref  cache=yes, time=1280.5
simple_idct_add_armv5te_ref  cache=no,  time=1538.2
simple_idct_add_armv5te_ref  cache=yes, time=1397.9
simple_idct_armv6  time=762.4
simple_idct_put_armv6  cache=no,  time=1034.9
simple_idct_put_armv6  cache=yes, time=765.4
simple_idct_add_armv6  cache=no,  time=1063.2
simple_idct_add_armv6  cache=yes, time=903.2
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1220.0
simple_idct_put_armv5te  cache=no,  time=1413.3
simple_idct_put_armv5te  cache=yes, time=1355.4
simple_idct_add_armv5te  cache=no,  time=1576.0
simple_idct_add_armv5te  cache=yes, time=1417.2
simple_idct_armv5te_ref  time=1872.0
simple_idct_put_armv5te_ref  cache=no,  time=2079.6
simple_idct_put_armv5te_ref  cache=yes, time=2081.5
simple_idct_add_armv5te_ref  cache=no,  time=2342.7
simple_idct_add_armv5te_ref  cache=yes, time=2190.1
simple_idct_armv6  time=1138.9
simple_idct_put_armv6  cache=no,  time=1426.7
simple_idct_put_armv6  cache=yes, time=1144.8
simple_idct_add_armv6  cache=no,  time=1444.1
simple_idct_add_armv6  cache=yes, time=1281.9

Test results from XScale are needed to check if my assumptions are correct (I used ARM9E, ARM11 and XScale manuals for reference to write code that works the best on all these CPUs, but could only test it on Nokia 770 and N800). Theoretically, results from XScale should be very similar to the results from Nokia N800 (ARM11). Lower numbers are better (that is time for running IDCT in cpu cycles). Functions with '_ref' suffix belong to the reference armv5te optimized idct implementation from mplayer 1.0rc1

If anybody want to build an optimized mplayer, you need to download this file and replace simple_idct_armv5te.S in your mplayer sources.
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)

Capn_Fish

  • Hero Member
  • *****
  • Posts: 2342
    • View Profile
    • http://
Mplayer Development And Optimization For Arm
« Reply #51 on: July 14, 2007, 05:47:00 pm »
I'll see if I can give it a try.

How much is this likely to speed up MPlayer, or is that what you're trying to determine?
SL-C750- pdaXrom beta 1 (mostly unused)
Current distro: Gentoo

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #52 on: July 14, 2007, 06:04:47 pm »
Quote
I'll see if I can give it a try.

How much is this likely to speed up MPlayer, or is that what you're trying to determine?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=164913\"][{POST_SNAPBACK}][/a][/div]
IDCT usually takes 20-40% of video decoding time. There will be no huge overall speedup, but the improvement should be quite noticeable (IDCT itself becomes up to 1.5x faster on ARM11). The goal is to reduce performance difference from the mplayer compiled with IPP (see a previous tjchick's post) and possibly beat it

The best results can be achieved by using IWMMX instructions though. But some older cores do not support IWMMX (PXA255 for example) and a tweaked ARMv5TE IDCT would be handy there. Also IWMMX optimized IDCT still needs to be written and this ARMv5TE IDCT can serve as a placeholder until then.
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)

Civil

  • Full Member
  • ***
  • Posts: 103
    • View Profile
    • http://
Mplayer Development And Optimization For Arm
« Reply #53 on: July 15, 2007, 09:45:15 am »
pxa270, 416MHz (Zaurus C3100), Gentoo 2007.0, eabi.
Code: [Select]
Assuming cpu clock frequency 416MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=751.9
simple_idct_put_armv5te  cache=no,  time=1988.0
simple_idct_put_armv5te  cache=yes, time=860.2
simple_idct_add_armv5te  cache=no,  time=1136.2
simple_idct_add_armv5te  cache=yes, time=923.1
simple_idct_armv5te_ref  time=1131.8
simple_idct_put_armv5te_ref  cache=no,  time=1297.1
simple_idct_put_armv5te_ref  cache=yes, time=1281.0
simple_idct_add_armv5te_ref  cache=no,  time=1625.5
simple_idct_add_armv5te_ref  cache=yes, time=1385.5
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1168.7
simple_idct_put_armv5te  cache=no,  time=2281.7
simple_idct_put_armv5te  cache=yes, time=1277.0
simple_idct_add_armv5te  cache=no,  time=1595.2
simple_idct_add_armv5te  cache=yes, time=1340.3
simple_idct_armv5te_ref  time=1821.7
simple_idct_put_armv5te_ref  cache=no,  time=1988.0
simple_idct_put_armv5te_ref  cache=yes, time=1981.6
simple_idct_add_armv5te_ref  cache=no,  time=2326.5
simple_idct_add_armv5te_ref  cache=yes, time=2084.4
Zaurus C-3100 ( Gentoo 2007.0 eabi, kernel 2.6.21.6)
http://www.zavrik.info - Russian Zaurus Site.

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #54 on: July 15, 2007, 12:40:37 pm »
Quote
pxa270, 416MHz (Zaurus C3100), Gentoo 2007.0, eabi.
...
Thanks for running this test. Almost all is just as I expected, XScale pipeline is really very similar to ARM11. Number crunching part of IDCT is now ~1.5x faster ('simple_idct_armv5te' vs. 'simple_idct_armv5te_ref'). Also everything is very fast if we don't take memory performance into account and all the memory accesses hit cache.

But generally we are interested in performance of functions 'simple_idct_put_armv5te' and 'simple_idct_add_armv5te' when the results get stored into memory and that memory region is not in the cache. Everything is fine with 'simple_idct_add_armv5te' and it really got quite a lot faster. But there seems to be an unexpected problem with 'simple_idct_put_armv5te'. Probably write buffer (some temporary storage in cpu for memory writes that bypass cache) overflows and XScale pipeline stalls resulting in a very bad performance. When 'simple_idct_put_armv5te' stores results into memory region which is in cache, it works very fast. I'll try to tweak the code a bit and will ask you to rerun this test a bit later.

Thanks again for running the test, if we did not check this code on XScale before its submission to ffmpeg, performance on XScale would be not too good (don't know how it would affect overall results as 'simple_idct_add_armv5te' would speed up and 'simple_idct_put_armv5te' would slow down).

Anyway, after the code gets fixed for XScale, I think we can expect something like 5-10% of overall video decoding improvement on it (depending on video file).
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #55 on: August 29, 2007, 01:33:02 am »
I'm sorry for a long delay with an answer. Could you try to run this idct test on XScale again? I believe that this performance regression for 'simple_idct_put_armv5te' should be fixed now.
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)

speculatrix

  • Administrator
  • Hero Member
  • *****
  • Posts: 3707
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #56 on: August 30, 2007, 05:50:21 pm »
Any improvement at all is very much welcomed - I hope that these optimisations will make it into Angstrom as soon as proven and stable!
Gemini 4G/Wi-Fi owner, formerly zaurus C3100 and 860 owner; also owner of an HTC Doubleshot, a Zaurus-like phone.

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #57 on: September 02, 2007, 02:18:17 pm »
Quote
Any improvement at all is very much welcomed - I hope that these optimisations will make it into Angstrom as soon as proven and stable!
Well, I'm maintaining mplayer package for maemo and have some good stuff already which I would like to contribute to ffmpeg. I'm only posting some test code sample here to ensure that these my submissions will not cause any regressions on XScale and will not hurt you  This code has already proven very useful on Nokia internet tablets, and most likely will be good for Zaurus too. But nobody knows for sure and so it is better to test everything (as the test done by Civil proved). Having XScale device for testing would be useful for finetuning code for better performance and probably even trying IWMMX optimizations, but I'm not sure if I want to spend 400-500 euro on just one more toy. Maybe if somebody could lend me XScale powered linux PDA for a few weekends, everything would be much easier and faster

By the way, here are the latest synthetic benchmarks of ARMv5TE optimized IDCT (SVN revision 249) on Nokia N800 as its ARM11 cpu is similar to XScale:
Code: [Select]
$ ./test-idct --freq=330
Assuming cpu clock frequency 330MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=685.8
simple_idct_put_armv5te  cache=no,  time=780.4
simple_idct_put_armv5te  cache=yes, time=770.0
simple_idct_add_armv5te  cache=no,  time=984.9
simple_idct_add_armv5te  cache=yes, time=853.3
simple_idct_add_pf_pld_armv5te  cache=no,  time=940.9
simple_idct_add_pf_pld_armv5te  cache=yes,  time=863.1
simple_idct_add_pf_ldr_armv5te  cache=no, time=958.3
simple_idct_add_pf_ldr_armv5te  cache=yes, time=862.5
simple_idct_armv5te_ref  time=1088.1
simple_idct_put_armv5te_ref  cache=no,  time=1286.2
simple_idct_put_armv5te_ref  cache=yes, time=1282.9
simple_idct_add_armv5te_ref  cache=no,  time=1518.2
simple_idct_add_armv5te_ref  cache=yes, time=1393.9
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1147.0
simple_idct_put_armv5te  cache=no,  time=1240.9
simple_idct_put_armv5te  cache=yes, time=1233.8
simple_idct_add_armv5te  cache=no,  time=1467.0
simple_idct_add_armv5te  cache=yes, time=1317.2
simple_idct_add_pf_pld_armv5te  cache=no,  time=1403.5
simple_idct_add_pf_pld_armv5te  cache=yes,  time=1366.2
simple_idct_add_pf_ldr_armv5te  cache=no, time=1438.8
simple_idct_add_pf_ldr_armv5te  cache=yes, time=1341.3
simple_idct_armv5te_ref  time=1872.6
simple_idct_put_armv5te_ref  cache=no,  time=2065.1
simple_idct_put_armv5te_ref  cache=yes, time=2064.9
simple_idct_add_armv5te_ref  cache=no,  time=2308.4
simple_idct_add_armv5te_ref  cache=yes, time=2179.2


Also here is a more real test with matrixbench_normdivx_vbrmp3.avi video clip from http://samples.mplayerhq.hu/benchmark/testsuite1/
Code: [Select]
Benchmark with current IDCT:
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 135.127s VO:   0.163s A:   0.000s Sys:   1.387s =  136.677s
BENCHMARKs: VC: 132.337s VO:   0.153s A:   0.000s Sys:   1.382s =  133.872s
BENCHMARKs: VC: 133.986s VO:   0.148s A:   0.000s Sys:   1.351s =  135.485s
BENCHMARKs: VC: 134.576s VO:   0.174s A:   0.000s Sys:   1.351s =  136.102s
BENCHMARKs: VC: 132.979s VO:   0.161s A:   0.000s Sys:   1.387s =  134.527s
BENCHMARKs: VC: 132.987s VO:   0.145s A:   0.000s Sys:   1.408s =  134.539s
BENCHMARKs: VC: 132.945s VO:   0.150s A:   0.000s Sys:   1.394s =  134.489s
BENCHMARKs: VC: 132.248s VO:   0.152s A:   0.000s Sys:   1.353s =  133.753s
BENCHMARKs: VC: 131.673s VO:   0.152s A:   0.000s Sys:   1.366s =  133.191s
BENCHMARKs: VC: 132.138s VO:   0.149s A:   0.000s Sys:   1.370s =  133.656s
BENCHMARKs: VC: 132.536s VO:   0.144s A:   0.000s Sys:   1.364s =  134.044s
BENCHMARKs: VC: 132.332s VO:   0.148s A:   0.000s Sys:   1.329s =  133.810s

Benchmark with the new optimized IDCT (after replacing 'simple_idct_armv5te.S' and recompiling mplayer):
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 122.543s VO:   0.162s A:   0.000s Sys:   1.416s =  124.120s
BENCHMARKs: VC: 120.901s VO:   0.152s A:   0.000s Sys:   1.371s =  122.424s
BENCHMARKs: VC: 122.490s VO:   0.147s A:   0.000s Sys:   1.338s =  123.975s
BENCHMARKs: VC: 124.826s VO:   0.151s A:   0.000s Sys:   1.325s =  126.302s
BENCHMARKs: VC: 123.052s VO:   0.143s A:   0.000s Sys:   1.393s =  124.588s
BENCHMARKs: VC: 121.897s VO:   0.146s A:   0.000s Sys:   1.366s =  123.409s
BENCHMARKs: VC: 122.406s VO:   0.139s A:   0.000s Sys:   1.359s =  123.903s
BENCHMARKs: VC: 123.448s VO:   0.150s A:   0.000s Sys:   1.381s =  124.979s
BENCHMARKs: VC: 119.141s VO:   0.143s A:   0.000s Sys:   1.360s =  120.644s
BENCHMARKs: VC: 120.555s VO:   0.147s A:   0.000s Sys:   1.340s =  122.042s
BENCHMARKs: VC: 120.686s VO:   0.141s A:   0.000s Sys:   1.377s =  122.203s
BENCHMARKs: VC: 120.902s VO:   0.143s A:   0.000s Sys:   1.358s =  122.402s

It really confirms video decoding speedup in the range 5-10% as estimated earlier. It is interesting to see how it will work on XScale. Also it would be very interesting to compare performance of this IDCT implementation to the one from IPP to check which one is faster now and how much?
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)

XorA

  • Full Member
  • ***
  • Posts: 101
    • View Profile
    • http://
Mplayer Development And Optimization For Arm
« Reply #58 on: September 03, 2007, 06:32:11 am »
A zaurus C3200 px27x

Before new idct

mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 209.368s VO:   0.168s A:   0.000s Sys:   3.011s =  212.547s
BENCHMARKs: VC: 213.062s VO:   0.170s A:   0.000s Sys:   3.022s =  216.253s
BENCHMARKs: VC: 214.726s VO:   0.169s A:   0.000s Sys:   3.039s =  217.935s
BENCHMARKs: VC: 214.936s VO:   0.170s A:   0.000s Sys:   2.674s =  217.780s
BENCHMARKs: VC: 215.113s VO:   0.170s A:   0.000s Sys:   3.182s =  218.464s
BENCHMARKs: VC: 215.065s VO:   0.170s A:   0.000s Sys:   2.618s =  217.853s
BENCHMARKs: VC: 215.700s VO:   0.170s A:   0.000s Sys:   2.611s =  218.482s
BENCHMARKs: VC: 215.293s VO:   0.170s A:   0.000s Sys:   2.606s =  218.069s
BENCHMARKs: VC: 215.575s VO:   0.170s A:   0.000s Sys:   2.621s =  218.366s
BENCHMARKs: VC: 215.655s VO:   0.169s A:   0.000s Sys:   2.608s =  218.433s
BENCHMARKs: VC: 215.323s VO:   0.170s A:   0.000s Sys:   2.614s =  218.107s
BENCHMARKs: VC: 215.373s VO:   0.170s A:   0.000s Sys:   2.610s =  218.153s

After new idct

mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 203.236s VO:   0.169s A:   0.000s Sys:   2.651s =  206.056s
BENCHMARKs: VC: 207.844s VO:   0.170s A:   0.000s Sys:   2.641s =  210.654s
BENCHMARKs: VC: 207.917s VO:   0.171s A:   0.000s Sys:   2.633s =  210.722s
BENCHMARKs: VC: 207.760s VO:   0.170s A:   0.000s Sys:   2.634s =  210.564s
BENCHMARKs: VC: 207.879s VO:   0.172s A:   0.000s Sys:   2.617s =  210.668s
BENCHMARKs: VC: 207.367s VO:   0.170s A:   0.000s Sys:   2.635s =  210.172s
BENCHMARKs: VC: 208.025s VO:   0.170s A:   0.000s Sys:   2.629s =  210.824s
BENCHMARKs: VC: 207.421s VO:   0.170s A:   0.000s Sys:   2.623s =  210.213s
BENCHMARKs: VC: 207.879s VO:   0.170s A:   0.000s Sys:   2.618s =  210.667s
BENCHMARKs: VC: 207.960s VO:   0.171s A:   0.000s Sys:   2.635s =  210.765s
BENCHMARKs: VC: 207.909s VO:   0.170s A:   0.000s Sys:   2.628s =  210.707s
BENCHMARKs: VC: 207.877s VO:   0.170s A:   0.000s Sys:   2.627s =  210.675s
--
SL-C860 XorABuild/GPE
Sandisk Connect Plus SD/1GMB CF/512M
BT PCMCIA

Serge

  • Jr. Member
  • **
  • Posts: 51
    • View Profile
Mplayer Development And Optimization For Arm
« Reply #59 on: September 04, 2007, 03:03:28 pm »
OK, thanks, so at least this IDCT optimization is useful on Zaurus too. I'll try to submit it upstream soon, so that we would all have it in mplayer 1.0rc2 whenever it gets released

But video performance on Zaurus looks quitey bad according to this benchmark, hence significantly lower relative effect of IDCT optimization. Poor performance is partially caused by IWMMXT optimizations not getting enabled in the default mplayer 1.0rc1 sources because of a bug. Also earlier in this thread we got benchmarks from atty's build of mplayer and it had a much better performance. A large part of this improvement was considered to be introduced by the use of IPP. But IPP only provides IDCT acceleration and IDCT looks to be quite fast already (if 1.5x IDCT performance improvement results in 7-8 seconds of difference, the whole IDCT probably takes no more than 30 seconds of all the decoding time). Even if IPP magically reduced IDCT overhead to zero, there is still too much time wasted somewhere remaining. Maybe it is still a good idea to try to find the source of this performance bottleneck and fix it once and for all (submitting all the relevant patches to upstream mplayer/ffmpeg)?

There was an idea about slow memory causing performance problems. But memory performance (both bandwidth and latency) can be easily benchmarked.

Also could I/O performance (reading from flash memory or HDD) affect video decoding time so much on Zaurus?. In this case putting some video clip in ramdisk should eliminate this factor.
Siarhei Siamashka (ssvb on #maemo, irc.freenode.net)
currently taking part in porting MPlayer to Nokia 770 and Nokia N800, feel free to join :)