I am currently working on new dumps for wikipedia (checkout http://www.crispy-cow.de/wikimedia/). Hope Rafal ("rafm") and me can work together on improving things
[div align=\"right\"][a href=\"index.php?act=findpost&pid=79452\"][{POST_SNAPBACK}][/a][/div]
I have been looking at the quality of some of the dumps in BEDIC format, have seen some of the \n {} type artifacts, blank articals etc. and have always felt a little impotent about being to help given that I really don't have the pre-requisite perl skillsets necessary to intimately understand the scripts. I am thinking about producing something in C++ capable of doing this with extensible markup translation parsers for this project. Initially I have written an if extending ring-buffer module capable of reading articals into memory for processing using the minimum amount of RAM but accomodating some of the larger articals.
I have tested this ring buffer technique allowing it to read the current dumps which include archive articals approximately 3Mb in size (giving ~1500 x 512byte buffers in the ring).
My next step is to work on the markup translation and therefore am going to need a complete understanding of the markup used in the Wikipedia articals (this should be fairly easy - I expect that this takes the documented Wiki tags directly) and the ZBedic markup tags.
I know that the libbedic/doc directory describes the database format and tags used in the markup but I wanted first of all to check if this is up to date or if I should be pulling the markup render apart for Zbedic to determine new tags.... or if anyone has a more up to date list of markup tags could they possibly share please ?
- Andy