The cxxx models can also use iwmmxt instructions, but a crude test showed it only gives a ~2% improvement, but there's a lot of room for improvement.
That seems a bit too low, I wonder if mplayer was configured and compiled correctly. The point is that motion compensation code in mplayer is currently much better optimized for iwmmxt (that all work was done by atty). You can just look into mplayer sources.
Here is the code used for ARM without iwmmx (libavcodec/armv4l/dsputil_arm.c):
/* c->put_pixels_tab[0][0] = put_pixels16_arm; */ // NG!
c->put_pixels_tab[0][1] = put_pixels16_x2_arm; //OK!
c->put_pixels_tab[0][2] = put_pixels16_y2_arm; //OK!
/* c->put_pixels_tab[0][3] = put_pixels16_xy2_arm; /\* NG *\/ */
/* c->put_no_rnd_pixels_tab[0][0] = put_pixels16_arm; */
c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_arm; // OK
c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_arm; //OK
/* c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_arm; //NG */
c->put_pixels_tab[1][0] = put_pixels8_arm; //OK
c->put_pixels_tab[1][1] = put_pixels8_x2_arm; //OK
/* c->put_pixels_tab[1][2] = put_pixels8_y2_arm; //NG */
/* c->put_pixels_tab[1][3] = put_pixels8_xy2_arm; //NG */
c->put_no_rnd_pixels_tab[1][0] = put_pixels8_arm;//OK
c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_arm; //OK
c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_arm; //OK
/* c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_arm;//NG */
Compare it with the following (libavcodec/armv4l/dsputil_iwmmxt.c):
c->put_pixels_tab[0][0] = put_pixels16_iwmmxt;
c->put_pixels_tab[0][1] = put_pixels16_x2_iwmmxt;
c->put_pixels_tab[0][2] = put_pixels16_y2_iwmmxt;
c->put_pixels_tab[0][3] = put_pixels16_xy2_iwmmxt;
c->put_no_rnd_pixels_tab[0][0] = put_pixels16_iwmmxt;
c->put_no_rnd_pixels_tab[0][1] = put_no_rnd_pixels16_x2_iwmmxt;
c->put_no_rnd_pixels_tab[0][2] = put_no_rnd_pixels16_y2_iwmmxt;
c->put_no_rnd_pixels_tab[0][3] = put_no_rnd_pixels16_xy2_iwmmxt;
c->put_pixels_tab[1][0] = put_pixels8_iwmmxt;
c->put_pixels_tab[1][1] = put_pixels8_x2_iwmmxt;
c->put_pixels_tab[1][2] = put_pixels8_y2_iwmmxt;
c->put_pixels_tab[1][3] = put_pixels8_xy2_iwmmxt;
c->put_no_rnd_pixels_tab[1][0] = put_pixels8_iwmmxt;
c->put_no_rnd_pixels_tab[1][1] = put_no_rnd_pixels8_x2_iwmmxt;
c->put_no_rnd_pixels_tab[1][2] = put_no_rnd_pixels8_y2_iwmmxt;
c->put_no_rnd_pixels_tab[1][3] = put_no_rnd_pixels8_xy2_iwmmxt;
c->avg_pixels_tab[0][0] = avg_pixels16_iwmmxt;
c->avg_pixels_tab[0][1] = avg_pixels16_x2_iwmmxt;
c->avg_pixels_tab[0][2] = avg_pixels16_y2_iwmmxt;
c->avg_pixels_tab[0][3] = avg_pixels16_xy2_iwmmxt;
c->avg_no_rnd_pixels_tab[0][0] = avg_pixels16_iwmmxt;
c->avg_no_rnd_pixels_tab[0][1] = avg_no_rnd_pixels16_x2_iwmmxt;
c->avg_no_rnd_pixels_tab[0][2] = avg_no_rnd_pixels16_y2_iwmmxt;
c->avg_no_rnd_pixels_tab[0][3] = avg_no_rnd_pixels16_xy2_iwmmxt;
c->avg_pixels_tab[1][0] = avg_pixels8_iwmmxt;
c->avg_pixels_tab[1][1] = avg_pixels8_x2_iwmmxt;
c->avg_pixels_tab[1][2] = avg_pixels8_y2_iwmmxt;
c->avg_pixels_tab[1][3] = avg_pixels8_xy2_iwmmxt;
c->avg_no_rnd_pixels_tab[1][0] = avg_no_rnd_pixels8_iwmmxt;
c->avg_no_rnd_pixels_tab[1][1] = avg_no_rnd_pixels8_x2_iwmmxt;
c->avg_no_rnd_pixels_tab[1][2] = avg_no_rnd_pixels8_y2_iwmmxt;
c->avg_no_rnd_pixels_tab[1][3] = avg_no_rnd_pixels8_xy2_iwmmxt;
As you see, machines that support iwmmxt have all the motion compensation related functions implemented in hand optimized assembly. It is strange that it only results in about 2% improvement.
The c7x0 models would benefit from people helping the libw100 project.
I see, but I can't provide any help here as I don't have any hardware but Nokia 770, more people interested in improving mplayer performance on different ARM devices are welcome here
I can only do assembly optimizations for ffmpeg using armv5te instruction set (including fast single cycle multiply dsp instructions).
Concerning the current progress, I have done some modification to valgrind (callgrind part) to make it simulate read-allocate cache behaviour (arm926 uses such cache) and now have some information about parts of code that cause many cache missed and do lots of work with the memory.
Things that may need optimizations and provide some improvement are:
- idct
- motion compensation (for non iwmmxt devices)
- dct_unquantize_h263_intra function (it contains almost 7% of instructions executed from callgrind statistics for this Doom video fragment, in addition it contains lots of multiplications which can be accelerated using dsp instructions), one more proof that it is needed to be optimized is that x86 code also contains mmx version of this function
Also I can prepare some small test programs for synthetic benchmarking of all these parts of code (idct, motion compensation, unquantize) so that it will be easier to see if there is any effect of optimizations. It is hard to notice any substantial effects of each one of these optimizations when just monitoring full video decoding time, but they all are cumulative and all together can provide quite a visible improvement. I have already done something like this when tried to optimize idct code (not too successful attempt because it focused on the code that was not real bottleneck, rows processing in idct generally takes much less time than columns):
http://lists.mplayerhq.hu/pipermail/ffmpeg...ber/045837.htmlWould anyone want to try running these benchmarks, or take some more active part in optimizing mplayer/ffmpeg?
PS. By the way, is it possible to watch that Doom video clip without (much) framedrops on nonoverclocked Zaurus?