the problem as you said with the pda's with gpus are that games that are not optimised for them and try to use them like a frame buffer instead of using the inbuilt acceleration do suffer from slower speeds, however you cant match betaplayer's (the core media player) speed on the axim x50v with its gpu if you only have the frame buffer
i would love a c3100 with a 2700G chip (with video out) and a open gl accelerated x sever/client
while your at it i dont suppose you want to add auto verctorisation to gcc for iwmmx? it would probely be somthing for the end of your list but would also be a great thing for the community
i plan to get a pxa270 optimised feed up soon with all packages on there with optimistaion flags, and where applicable, hand tuned ones.
look foward to the accelerated frame buffer, however for linux games that use opengl you might need to look into iwmmx'ing them as well to get an even larger speed boost. you might want to also look into directfb (
http://www.directfb.org/index.php) as they have a framework to accelerate anything that displays to the Frame buffer, they even have an x sever client (
http://www.directfb.org/index.php?path=Development%2FProjects%2FXDirectFB) that takes advantage of this, and if you port the mmx to iwmmx then you would see a performance boost for general stuff like window managers
as for arm asm try
http://www.heyrick.co.uk/assembler/index.html, good explantaion, once that is done look at the intel website and download the pdf on the xscale arcitecture, the one on iwmmx and the one that tells you how to port mmx to iwmmx as this will save you time by reusing mmx optimised code, less work for you. remeber when reading that link that the iwmmx stuff is a coprocessor as well as the performance monitoring and tuning stuff (usefull to help make code run faster)
one thing you might not have thoght of yet, have you overclocked? i can ramp mine up to 624mhz with no problems however for games you might see more of a performance boost from rasing the ram clockspeed as well as the bus speed due to coping operations to and from the framebuffer.
i think your list is good but i would add
optimise opengl mesa implemntation
find opengl ES implementation (it does exsist and is iwmmx optimised)
the open gl ES is for embbeded systems and has no floating point operations it is both a super and sub set of open gl (sub set because some functions and floating point were removed and super due to aditional functions for things like moive encoding acceleration)
and if you could optimise the blowfish and aes ciphers (there are mmx versions already) i would be greatlful. if you get icc working with the kernel drops us a line as i would want to do that as well
btw i had an x30 high as my last pda, that was one great pda, until the x50 came out