Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - Serge

Pages: [1] 2 3 4
1
Linux Applications / Mplayer Development And Optimization For Arm
« on: November 07, 2007, 02:26:29 am »
Hello zap,

Please also try testing atty's build without '-lavdopts idct=16' option (it forces armv5te optimized idct from ffmpeg, but atty's build should be able to use a more efficient iwmmxt optimized idct from IPP).

Anyway, as already mentioned in this thread, there is something wrong with mplayer running on Zaurus devices (or the devices with XScale core). For example, even Nokia 770 with 252MHz ARM9E cpu appears to be faster than Zaurus when playing this matrix video clip (time for decoding it is ~158 seconds). Though intuitively everything should be quite the opposite: Zaurus has a lot higher cpu clock frequency and supports iwmmxt SIMD instructions in addition to armv5te.

TCPMP might be an interesting option (for somebody else to try), but I'm satistied with mplayer/ffmpeg on Nokia 770 and N800 at the moment. Translating mplayer performance on Nokia 770 to 'TCMP percents', it would be something like 118%, and if we try to estimate how it would theoretically run at 624MHz, that would be ~290%. I know that this approximation is wrong as memory speed also does matter a lot, but anyway, looks like both TCPMP and ffmpeg should provide at least comparable performance.

In order to get optimal mplayer performance on Zaurus, somebody just needs to profile it there (doing it with gprof is quite simple), find performance bottlenecks and try to fix them. I might have a look at what's wrong if I got XScale device to experiment with (I had plans to buy some motorola EZX phone, A1200 or E6, but these plans are on hold now).

2
Nokia Tablet / The Nokia N800
« on: September 25, 2007, 08:29:02 am »
Quote
Alas, if only it were that simple. The Z uses an ARM5 and the N800 uses an ARM6, and the binaries are not compatible
A minor correction: N800 uses ARMv6 instruction set (ARM11 core) and Z uses ARMv5 + IWMMXT instruction set (XScale core).

Anyway, most packages on N800 are compiled for ARMv5 and the same binaries can be used on Nokia 770 as part of OS2007 Hacker's Edition. Third party binary blobs such as skype and other proprietary payload may be compiled with hardware specific optimizations enabled though.

3
Linux Applications / Mplayer Development And Optimization For Arm
« on: September 04, 2007, 03:03:28 pm »
OK, thanks, so at least this IDCT optimization is useful on Zaurus too. I'll try to submit it upstream soon, so that we would all have it in mplayer 1.0rc2 whenever it gets released

But video performance on Zaurus looks quitey bad according to this benchmark, hence significantly lower relative effect of IDCT optimization. Poor performance is partially caused by IWMMXT optimizations not getting enabled in the default mplayer 1.0rc1 sources because of a bug. Also earlier in this thread we got benchmarks from atty's build of mplayer and it had a much better performance. A large part of this improvement was considered to be introduced by the use of IPP. But IPP only provides IDCT acceleration and IDCT looks to be quite fast already (if 1.5x IDCT performance improvement results in 7-8 seconds of difference, the whole IDCT probably takes no more than 30 seconds of all the decoding time). Even if IPP magically reduced IDCT overhead to zero, there is still too much time wasted somewhere remaining. Maybe it is still a good idea to try to find the source of this performance bottleneck and fix it once and for all (submitting all the relevant patches to upstream mplayer/ffmpeg)?

There was an idea about slow memory causing performance problems. But memory performance (both bandwidth and latency) can be easily benchmarked.

Also could I/O performance (reading from flash memory or HDD) affect video decoding time so much on Zaurus?. In this case putting some video clip in ramdisk should eliminate this factor.

4
Linux Applications / Mplayer Development And Optimization For Arm
« on: September 02, 2007, 02:18:17 pm »
Quote
Any improvement at all is very much welcomed - I hope that these optimisations will make it into Angstrom as soon as proven and stable!
Well, I'm maintaining mplayer package for maemo and have some good stuff already which I would like to contribute to ffmpeg. I'm only posting some test code sample here to ensure that these my submissions will not cause any regressions on XScale and will not hurt you  This code has already proven very useful on Nokia internet tablets, and most likely will be good for Zaurus too. But nobody knows for sure and so it is better to test everything (as the test done by Civil proved). Having XScale device for testing would be useful for finetuning code for better performance and probably even trying IWMMX optimizations, but I'm not sure if I want to spend 400-500 euro on just one more toy. Maybe if somebody could lend me XScale powered linux PDA for a few weekends, everything would be much easier and faster

By the way, here are the latest synthetic benchmarks of ARMv5TE optimized IDCT (SVN revision 249) on Nokia N800 as its ARM11 cpu is similar to XScale:
Code: [Select]
$ ./test-idct --freq=330
Assuming cpu clock frequency 330MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=685.8
simple_idct_put_armv5te  cache=no,  time=780.4
simple_idct_put_armv5te  cache=yes, time=770.0
simple_idct_add_armv5te  cache=no,  time=984.9
simple_idct_add_armv5te  cache=yes, time=853.3
simple_idct_add_pf_pld_armv5te  cache=no,  time=940.9
simple_idct_add_pf_pld_armv5te  cache=yes,  time=863.1
simple_idct_add_pf_ldr_armv5te  cache=no, time=958.3
simple_idct_add_pf_ldr_armv5te  cache=yes, time=862.5
simple_idct_armv5te_ref  time=1088.1
simple_idct_put_armv5te_ref  cache=no,  time=1286.2
simple_idct_put_armv5te_ref  cache=yes, time=1282.9
simple_idct_add_armv5te_ref  cache=no,  time=1518.2
simple_idct_add_armv5te_ref  cache=yes, time=1393.9
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1147.0
simple_idct_put_armv5te  cache=no,  time=1240.9
simple_idct_put_armv5te  cache=yes, time=1233.8
simple_idct_add_armv5te  cache=no,  time=1467.0
simple_idct_add_armv5te  cache=yes, time=1317.2
simple_idct_add_pf_pld_armv5te  cache=no,  time=1403.5
simple_idct_add_pf_pld_armv5te  cache=yes,  time=1366.2
simple_idct_add_pf_ldr_armv5te  cache=no, time=1438.8
simple_idct_add_pf_ldr_armv5te  cache=yes, time=1341.3
simple_idct_armv5te_ref  time=1872.6
simple_idct_put_armv5te_ref  cache=no,  time=2065.1
simple_idct_put_armv5te_ref  cache=yes, time=2064.9
simple_idct_add_armv5te_ref  cache=no,  time=2308.4
simple_idct_add_armv5te_ref  cache=yes, time=2179.2


Also here is a more real test with matrixbench_normdivx_vbrmp3.avi video clip from http://samples.mplayerhq.hu/benchmark/testsuite1/
Code: [Select]
Benchmark with current IDCT:
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 135.127s VO:   0.163s A:   0.000s Sys:   1.387s =  136.677s
BENCHMARKs: VC: 132.337s VO:   0.153s A:   0.000s Sys:   1.382s =  133.872s
BENCHMARKs: VC: 133.986s VO:   0.148s A:   0.000s Sys:   1.351s =  135.485s
BENCHMARKs: VC: 134.576s VO:   0.174s A:   0.000s Sys:   1.351s =  136.102s
BENCHMARKs: VC: 132.979s VO:   0.161s A:   0.000s Sys:   1.387s =  134.527s
BENCHMARKs: VC: 132.987s VO:   0.145s A:   0.000s Sys:   1.408s =  134.539s
BENCHMARKs: VC: 132.945s VO:   0.150s A:   0.000s Sys:   1.394s =  134.489s
BENCHMARKs: VC: 132.248s VO:   0.152s A:   0.000s Sys:   1.353s =  133.753s
BENCHMARKs: VC: 131.673s VO:   0.152s A:   0.000s Sys:   1.366s =  133.191s
BENCHMARKs: VC: 132.138s VO:   0.149s A:   0.000s Sys:   1.370s =  133.656s
BENCHMARKs: VC: 132.536s VO:   0.144s A:   0.000s Sys:   1.364s =  134.044s
BENCHMARKs: VC: 132.332s VO:   0.148s A:   0.000s Sys:   1.329s =  133.810s

Benchmark with the new optimized IDCT (after replacing 'simple_idct_armv5te.S' and recompiling mplayer):
# mplayer -nosound -vo null -quiet -benchmark -loop 12 -lavdopts idct=16 matrixbench_normdivx_vbrmp3.avi | grep BENCHMARKs
BENCHMARKs: VC: 122.543s VO:   0.162s A:   0.000s Sys:   1.416s =  124.120s
BENCHMARKs: VC: 120.901s VO:   0.152s A:   0.000s Sys:   1.371s =  122.424s
BENCHMARKs: VC: 122.490s VO:   0.147s A:   0.000s Sys:   1.338s =  123.975s
BENCHMARKs: VC: 124.826s VO:   0.151s A:   0.000s Sys:   1.325s =  126.302s
BENCHMARKs: VC: 123.052s VO:   0.143s A:   0.000s Sys:   1.393s =  124.588s
BENCHMARKs: VC: 121.897s VO:   0.146s A:   0.000s Sys:   1.366s =  123.409s
BENCHMARKs: VC: 122.406s VO:   0.139s A:   0.000s Sys:   1.359s =  123.903s
BENCHMARKs: VC: 123.448s VO:   0.150s A:   0.000s Sys:   1.381s =  124.979s
BENCHMARKs: VC: 119.141s VO:   0.143s A:   0.000s Sys:   1.360s =  120.644s
BENCHMARKs: VC: 120.555s VO:   0.147s A:   0.000s Sys:   1.340s =  122.042s
BENCHMARKs: VC: 120.686s VO:   0.141s A:   0.000s Sys:   1.377s =  122.203s
BENCHMARKs: VC: 120.902s VO:   0.143s A:   0.000s Sys:   1.358s =  122.402s

It really confirms video decoding speedup in the range 5-10% as estimated earlier. It is interesting to see how it will work on XScale. Also it would be very interesting to compare performance of this IDCT implementation to the one from IPP to check which one is faster now and how much?

5
Linux Applications / Mplayer Development And Optimization For Arm
« on: August 29, 2007, 01:33:02 am »
I'm sorry for a long delay with an answer. Could you try to run this idct test on XScale again? I believe that this performance regression for 'simple_idct_put_armv5te' should be fixed now.

6
Linux Applications / Mplayer Development And Optimization For Arm
« on: July 15, 2007, 12:40:37 pm »
Quote
pxa270, 416MHz (Zaurus C3100), Gentoo 2007.0, eabi.
...
Thanks for running this test. Almost all is just as I expected, XScale pipeline is really very similar to ARM11. Number crunching part of IDCT is now ~1.5x faster ('simple_idct_armv5te' vs. 'simple_idct_armv5te_ref'). Also everything is very fast if we don't take memory performance into account and all the memory accesses hit cache.

But generally we are interested in performance of functions 'simple_idct_put_armv5te' and 'simple_idct_add_armv5te' when the results get stored into memory and that memory region is not in the cache. Everything is fine with 'simple_idct_add_armv5te' and it really got quite a lot faster. But there seems to be an unexpected problem with 'simple_idct_put_armv5te'. Probably write buffer (some temporary storage in cpu for memory writes that bypass cache) overflows and XScale pipeline stalls resulting in a very bad performance. When 'simple_idct_put_armv5te' stores results into memory region which is in cache, it works very fast. I'll try to tweak the code a bit and will ask you to rerun this test a bit later.

Thanks again for running the test, if we did not check this code on XScale before its submission to ffmpeg, performance on XScale would be not too good (don't know how it would affect overall results as 'simple_idct_add_armv5te' would speed up and 'simple_idct_put_armv5te' would slow down).

Anyway, after the code gets fixed for XScale, I think we can expect something like 5-10% of overall video decoding improvement on it (depending on video file).

7
Linux Applications / Mplayer Development And Optimization For Arm
« on: July 14, 2007, 06:04:47 pm »
Quote
I'll see if I can give it a try.

How much is this likely to speed up MPlayer, or is that what you're trying to determine?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=164913\"][{POST_SNAPBACK}][/a][/div]
IDCT usually takes 20-40% of video decoding time. There will be no huge overall speedup, but the improvement should be quite noticeable (IDCT itself becomes up to 1.5x faster on ARM11). The goal is to reduce performance difference from the mplayer compiled with IPP (see a previous tjchick's post) and possibly beat it

The best results can be achieved by using IWMMX instructions though. But some older cores do not support IWMMX (PXA255 for example) and a tweaked ARMv5TE IDCT would be handy there. Also IWMMX optimized IDCT still needs to be written and this ARMv5TE IDCT can serve as a placeholder until then.

8
Linux Applications / Mplayer Development And Optimization For Arm
« on: July 14, 2007, 05:16:53 pm »
Hi, I'm working on further optimizing ARMv5 IDCT for mplayer/ffmpeg. Older implementation from mplayer 1.0rc1 was only optimized for ARM9E cores. Now it should get noticeably faster on long pipeline cores such as XScale (Sharp Zaurus) and ARM11 (Nokia N800).

Can anybody compile and run the following test on XScale:

> svn checkout https://garage.maemo.org/svn/mplayer/trunk/libavcodec
> cd libavcodec/tests
> make test-idct

You may need to specify the name of your crosscompiler when running make (ex. 'CC="arm-softfloat-linux-gnueabi-gcc" make test-idct')

After that please copy 'test-idct' bunary to your device and run it specifying cpu clock frequency in the command line (for 416MHz Zaurus it would be './test-idct --freq=416')

For those who are curious, here are the results from running this test on Nokia 770:
Code: [Select]
> ./test-idct --freq=252
Assuming cpu clock frequency 252MHz (ARMv6 disabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=886.0
simple_idct_put_armv5te  cache=no,  time=1062.2
simple_idct_put_armv5te  cache=yes, time=1032.8
simple_idct_add_armv5te  cache=no,  time=1323.7
simple_idct_add_armv5te  cache=yes, time=1186.2
simple_idct_armv5te_ref  time=1041.8
simple_idct_put_armv5te_ref  cache=no,  time=1257.6
simple_idct_put_armv5te_ref  cache=yes, time=1253.0
simple_idct_add_armv5te_ref  cache=no,  time=1561.9
simple_idct_add_armv5te_ref  cache=yes, time=1445.6
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1423.4
simple_idct_put_armv5te  cache=no,  time=1665.7
simple_idct_put_armv5te  cache=yes, time=1655.3
simple_idct_add_armv5te  cache=no,  time=1934.6
simple_idct_add_armv5te  cache=yes, time=1783.8
simple_idct_armv5te_ref  time=1698.6
simple_idct_put_armv5te_ref  cache=no,  time=1914.0
simple_idct_put_armv5te_ref  cache=yes, time=1911.6
simple_idct_add_armv5te_ref  cache=no,  time=2221.2
simple_idct_add_armv5te_ref  cache=yes, time=2098.9

Results for Nokia N800:
Code: [Select]
> ./test-idct --freq=330 --enable-armv6
Assuming cpu clock frequency 330MHz (ARMv6 enabled)
Please be patient and wait for the results, test requires quite a lot of time to run...
correctness tests passed
--- benchmarking with zero idct coefficients ---
simple_idct_armv5te  time=751.3
simple_idct_put_armv5te  cache=no,  time=947.7
simple_idct_put_armv5te  cache=yes, time=866.9
simple_idct_add_armv5te  cache=no,  time=1099.2
simple_idct_add_armv5te  cache=yes, time=937.6
simple_idct_armv5te_ref  time=1084.5
simple_idct_put_armv5te_ref  cache=no,  time=1288.4
simple_idct_put_armv5te_ref  cache=yes, time=1280.5
simple_idct_add_armv5te_ref  cache=no,  time=1538.2
simple_idct_add_armv5te_ref  cache=yes, time=1397.9
simple_idct_armv6  time=762.4
simple_idct_put_armv6  cache=no,  time=1034.9
simple_idct_put_armv6  cache=yes, time=765.4
simple_idct_add_armv6  cache=no,  time=1063.2
simple_idct_add_armv6  cache=yes, time=903.2
--- benchmarking with random idct coefficients ---
simple_idct_armv5te  time=1220.0
simple_idct_put_armv5te  cache=no,  time=1413.3
simple_idct_put_armv5te  cache=yes, time=1355.4
simple_idct_add_armv5te  cache=no,  time=1576.0
simple_idct_add_armv5te  cache=yes, time=1417.2
simple_idct_armv5te_ref  time=1872.0
simple_idct_put_armv5te_ref  cache=no,  time=2079.6
simple_idct_put_armv5te_ref  cache=yes, time=2081.5
simple_idct_add_armv5te_ref  cache=no,  time=2342.7
simple_idct_add_armv5te_ref  cache=yes, time=2190.1
simple_idct_armv6  time=1138.9
simple_idct_put_armv6  cache=no,  time=1426.7
simple_idct_put_armv6  cache=yes, time=1144.8
simple_idct_add_armv6  cache=no,  time=1444.1
simple_idct_add_armv6  cache=yes, time=1281.9

Test results from XScale are needed to check if my assumptions are correct (I used ARM9E, ARM11 and XScale manuals for reference to write code that works the best on all these CPUs, but could only test it on Nokia 770 and N800). Theoretically, results from XScale should be very similar to the results from Nokia N800 (ARM11). Lower numbers are better (that is time for running IDCT in cpu cycles). Functions with '_ref' suffix belong to the reference armv5te optimized idct implementation from mplayer 1.0rc1

If anybody want to build an optimized mplayer, you need to download this file and replace simple_idct_armv5te.S in your mplayer sources.

9
Linux Applications / Mplayer Development And Optimization For Arm
« on: March 22, 2007, 02:56:02 pm »
Quote
Quote
By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=156919\"][{POST_SNAPBACK}][/a][/div]
Except that ARM has no immediate assignments
MMX instruction set does not have immediate assignments either  In any case, that's not a big deal.

Quote
and needs aligned data...
FFmpeg does special care for alignment, many functions have guaranteed alignment specified for the data they are processing (some SSE instructions require 16-byte alignment after all, so ARM is not the most strict in this respect). Input data for IDCT is also 16-byte aligned for example, that's more than enough for ARM

Anyway, somebody just needs to give it a try. To encourage you more and prove that it might work, looks like atty took the existing MMX implementation of dct_unquantize_h263_intra_mmx and converted it to dct_unquantize_h263_intra_iwmmxt  Probably he did not care about IDCT as he could just use IPP instead, so maybe doing a conversion from MMX to IWMMXT for IDCT is also possible with not so much work (everything is relative of course). I wonder what implementation would be faster? On one hand IPP is a library developed by professionals from Intel, on the other hand FFmpeg proved to be very well optimized beating many other codecs on x86 platform and default IDCT used in it is MMX optimized.

10
Linux Applications / Mplayer Development And Optimization For Arm
« on: March 21, 2007, 12:26:33 pm »
You can try to override idct by using '-lavdopts idct=<some_number>'  in atty's build and test it. After getting the numbers we can see if it is really IPP that matters, or maybe atty's build has some other optimizations.

By the way, IWMMXT seems to be very close to MMX (there is even a table of mapping of the instructions in intel manual). FFmpeg has MMX optimized IDCT implementation. So maybe direct conversion of MMX->IWMMXT is not so hard?

11
Linux Applications / Mplayer Development And Optimization For Arm
« on: March 15, 2007, 02:52:36 pm »
Quote
Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.
That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.

Quote
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.
This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770: http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.html

Quote
So either the version which was in use was pretty good
It was just not performance critical, I wonder why you even managed to see some improvement

Quote
(be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output),
Actually I find compiler generated code for ARM quite poorly optimized. It can't make the good use of conditionally executed instructions, can't use DSP instructions, schedule code in an optimal way to avoid pipeline stalls. Of course, it only makes sense optimizing code that is bottleneck to gain any visible performance improvement overall.

I prefer to always develop some simple performance and correctness tests for the performance critical functions I'm optimizing. So I can ensure that they really provide performance improvement and do not introduce stability issues.

Random assembly hacking is not a productive way of working for sure

Quote
or the system is memory bound as others have suggested.
This particular function is run on fully cached data, so memory access time is not important here. I investigated mplayer memory access pattern using valgrind (callgrind tool) getting more or less precise information about cache misses.

Code that heavily depends on memory performance is in motion compensation functions and partially idct (cache write misses for destination buffer).

Quote
It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)
Yes, paying special attention at accessing memory properly and using prefetch can improve performance quite noticeably.

PS. In order to ensure that video is decoded not only fast, but also right, you can use '-vo md5' option. I noticed some really ugly video decoding artefacts when using standard ARM optimized IDCT (some vertical stripes on panning scenes), ARMv5TE optimized IDCT is identical to C implementation.

12
Linux Applications / Mplayer Development And Optimization For Arm
« on: March 14, 2007, 01:32:04 pm »
Quote
Yes, you really do - the code gets compiled, but not used, as the code is only installed following a test like this:
if( mm_flags & MM_IWMMXT ) -> install dsp code.

It fills in mm_flags wih 0! There is some code to overide this using avctx->dsp_mask & FF_MM_FORCE, but I did not look too hard at getting this going. I wonder if this is related to the lavdopts somehow?

That's why the others only saw a 2% improvment (compiling with the better tune options), and I see a 30% or so improvement.
Thanks for the detailed explanation, it clarifies the current situation a lot. When I submitted ARMv5TE instructions support for MPlayer configure, I could not verify that IWMMXT works as well (for an obvious reason, I don't have any device that supports IWMMXT): http://lists.mplayerhq.hu/pipermail/mplaye...ber/046537.html

Please check the latest MPlayer SVN just as Meanie suggested, and if it still has problems with enabling iwmmxt, please try to make a clean fix and submit this patch upstream. If you check the first post in this thread, you will see that upstream developers are not very familiar with ARM platform. Only atty did some improvements for MPlayer at some time in the past, but he is unwilling to help upstream to integrate his fixes for whatever reason. So it is up to us (and you as well) to work on improving ARM support in MPlayer (including IWMMXT support). Nobody else can do this job. And upstream developers are not obliged to fix our problems.

PS. I'm sorry if it was me who created a false impression of IWMMXT being fully supported in MPlayer 1.0.rc1

edit: IWMMX has some additional registers, so their save/restore on context switches should be probably supported by the kernel? Maybe these extra checks in mplayer are there to ensure that it is safe to use iwmmxt even though cpu itself may support them? Anyway that was just a wild guess, I'm not familiar with XScale at all.

And thanks for actually digging into the code and checking if iwmmxt really works, the results posted in this thread were suspicious from the very start

13
Linux Applications / Mplayer Development And Optimization For Arm
« on: March 14, 2007, 12:14:16 pm »
Quote
Hmm. It looks like the mplayer 1.0rc1 code includes iwmmxt stuff, but does not actually use it unless you change the code.
Do you really need to change the code to use iwmmx? Isn't it a simple matter of properly running configure?

Did you try using something similar to what I suggested in this thread before?
CFLAGS="-O4 -mcpu=iwmmxt -fomit-frame-pointer -ffast-math" ./configure
make

14
Zaurus - pdaXrom / New Movie Player Solution - Ffplay
« on: February 24, 2007, 06:14:38 am »
Quote
they are all based on ffmpeg anyway so ffplay, tcpmp, mplayer, etc.. all will have similar performance depending on the compile time optimisations and/or build options as well as the cvs/svn version they are based on...
As far as I know, TCPMP has both its own optimized decoders (better optimized) and ffmepg (better compatibility).

But I still think that ffmpeg has a good potential for further optimizations on ARM devices. Some ARM related optimizations were added recently (can be in MPlayer 1.0rc1), some are still in SVN and will be available when the next version of MPlayer gets released. Even more optimizations will be added later.

Right now 252MHz Nokia 770 is quite good for video playback (using mplayer), it is almost able play 512x384 videos smoothly. Anything lower than that can be watched more or less successfully. A newer model, 330MHz Nokia N800 has no problems with 512x384 videos (but it still has very annoying problems with tearing because of improper vsync).

15
Zaurus - pdaXrom / New Movie Player Solution - Ffplay
« on: February 23, 2007, 09:26:16 pm »
Quote
I did a lot of work wrt video playback on the zaurus and the bottleneck is indeed screen drawing. atty's mplayer and my optimised mplayer compile is well capable of decoding videos 640x480 and ~768kbps bitrates but really screws up in drawing.
On the other hand, screen drawing performance is not so critical as decoding. It is possible to skip drawing of some frames on heavy video, but you cant skip decoding without getting some very ugly artefacts on screen. Also screen drawing performance can be probably improved a lot, it takes ~20% cpu resources or less than that on Nokia 770 now (with hardware YUV colorspace support and JIT accelerated software scaling), from what I read, Zaurus should have at least comparable capabilities if not better.

As for decoding performance. My observations show that increasing resolution has much higher impact on performance than bitrate (low resolution videos can have bitrate way higher than 1000kbps and play nicely, but 640x480 even with a low bitrate is a challenge). Also how did you check that mplayer is capable of decoding this video? Just running it with -benchmark option and verifying that decoding took less time than video clip length is not enough for smooth playback. Resources consumption for decoding can vary between different frames a lot, the most complicated scenes are those which contain a lot of panning and motion. So you can get really bad performance on some scenes while cpu would be used quite low on the others. On of the examples of such videos is Doom clip that was used to test mplayer performance on some devices here: https://www.oesf.org/forums/index.php?showtopic=22280

I don't know if TCPMP is that much better inherently (and if video output is the real bottleneck, TCPMP would crawl too until it gets optimized video output). Probably just optimizing mplayer and ffmpeg can result in comparable results.

Pages: [1] 2 3 4