Yes, IWMMX needs OS support, as well as having the right processor. Unfortunatly I (and others) can not find a simple, portable method for detecting this. So the only option is to try and use iwmmxt is it is compiled in - you need to turn on compile switches to get it.
That's probably fine. By the way, you can also try to compile MPlayer with the use of Intel IPP (Integrated Performance Primitives) library and check if it helps to improve performance.
I also noted one more thing - the iwmmxt code does not provide the h363_inter function, so I canged ffmpeg to use the armv5 version. This provided a small speed increase.
This should not be a problem as dct_unquantize_h263_inter is not a performance critical function. But it is pretty much similar to dct_unquantize_h263_intra (which consumes a noticeable amount of decoding time, something like ~7%), so implementing it was quite easy. You can see some gprof output with the statistics about decoding this Doom video clip on Nokia 770:
http://lists.mplayerhq.hu/pipermail/ffmpeg...ary/050363.htmlSo either the version which was in use was pretty good
It was just not performance critical, I wonder why you even managed to see some improvement
(be warned - it is easy to spend a lot of time writing arm assembler which is *worse* than the compiler output),
Actually I find compiler generated code for ARM quite poorly optimized. It can't make the good use of conditionally executed instructions, can't use DSP instructions, schedule code in an optimal way to avoid pipeline stalls. Of course, it only makes sense optimizing code that is bottleneck to gain any visible performance improvement overall.
I prefer to always develop some simple performance and correctness tests for the performance critical functions I'm optimizing. So I can ensure that they really provide performance improvement and do not introduce stability issues.
Random assembly hacking is not a productive way of working for sure
or the system is memory bound as others have suggested.
This particular function is run on fully cached data, so memory access time is not important here. I investigated mplayer memory access pattern using valgrind (callgrind tool) getting more or less precise information about cache misses.
Code that heavily depends on memory performance is in motion compensation functions and partially idct (cache write misses for destination buffer).
It might be worth looking at joining together more of the reads and writes if possible (the system uses SDRAM, so the performance for single words sucks compared to 2 words etc, in the case of an overstretched cache)
Yes, paying special attention at accessing memory properly and using prefetch can improve performance quite noticeably.
PS. In order to ensure that video is decoded not only fast, but also right, you can use '-vo md5' option. I noticed some really ugly video decoding artefacts when using standard ARM optimized IDCT (some vertical stripes on panning scenes), ARMv5TE optimized IDCT is identical to C implementation.