OESF Portables Forum
Everything Else => General Support and Discussion => Zaurus General Forums => Archived Forums => Software => Topic started by: iamasmith on December 12, 2004, 03:11:53 pm
-
Hi, I have a couple of verions of wiki2bedic.pl and neither of them convert current Wikipedia databases.
I think the latest prebuilt Wikipedia for zbedic is from around 10th July and there's been a lot of activity since.
Anyone got a working version of wiki2bedic.pl for current database format ?
- Andy
-
More specifically if I run the wiki2bedic script on a current database it produces a bedic.dic file which gives 'Integrity Failure' when opened on the desktop using bedic (yep I want to move it to the Z but I guess it should also work using bedic).
Also there doesn't seem to be an index property at the beginning of the file with the version that I have.
Anyone able to point me at a working version ?
-
Hi,
you are on the way.
You need this:
http://www.freedict.de/download/wiki2bedic.pl (http://www.freedict.de/download/wiki2bedic.pl)
Then:
libbedic from here:
http://sourceforge.net/project/showfiles.p...ackage_id=56566 (http://sourceforge.net/project/showfiles.php?group_id=51673&package_id=56566)
Then you have to replace dictionary.cpp
with this:
http://www.freedict.de/download/dictionary.cpp (http://www.freedict.de/download/dictionary.cpp)
(I heard this maybe dont works with the newest version of libbedic, but I had once success with version 0.9.1)
Then:
1) make
2) make xerox
After that you will get a binary named xerox. Then:
xerox -d wikipedia.dic wikipedianew.dic
After that you should pack the wikipedianew.dic with diczip
Please tell me if it worked, I have to do that again for the german wikipedia and deleted my environment, because another guy promised to provide the german community with actual wikipedias and the fols out there now running out of wikipedia.
Cheers,
Sam
-
Then you have to replace dictionary.cpp
with this:
http://www.freedict.de/download/dictionary.cpp (http://www.freedict.de/download/dictionary.cpp)
(I heard this maybe dont works with the newest version of libbedic, but I had once success with version 0.9.1)
Could you briefly explain the changes in dictionary.cpp, so that I can add those changes to the latest version of zbedic. Thanks.
-
Hi,
I am not responsible for the dictionary.cpp
It comes from Horst from freedict.de and the only thing I did was translating his posts in the german community and adding some missing issues by my experiences.
I think Horst is a nice Guy and you can send him an email. I fhe dont response, I can also write an german email to him, since the zbedic developement seems important to me.
About the file itself: I think horst made/changed it to be able to convert that wikipediaformat to zbedic.
Hope this helps,
Cheers,
Sam
-
OK, I see xerox builds the index too when it sorts the dictionary.
Tried the versions you suggested which results in some success, however, some articles don't seem to have correctly processed format strings and have \n embedded in the text (the two characters, not processes) whilst if you start to type zaurus in bedic then the thing just hangs.
Now reverted back to the unpatched libbdic and and modified version of wiki2bedic.pl so I'll post an update with the results.
- Andy
-
No, older version of xerox fails with a segmentation fault before completion.
Newer version of xerox does complete, however, searching for anything in the range X,Y or Z makes bedic crash. Is there an upper limit on the index size I wonder ?, the cur, en, version of the Wikipedia database now has 425325 entries...
-
OK, the unpatched version of xerox just runs and runs consuming memory. Set up my system with 3Gb of Swap and 1Gb RAM and it eventually failed with segmentation fault before the swap was depleted.
The patched version seems to run quite nicely without consuming all that memory, however, I think that the wiki2bedic.pl script is not doing all that it could. It's producing some blank articals, other articals have pretty duff formatting (I think there are now extra markup characters in the articals + \n appears quite a lot in the text).
I'm not really a perl programmer but I will take a look to see if there's as sensible way of adding the extra markup (bullet lists particularly seem to fail).
Lookups using qbedic still hang when typing in the artical name and a few articals seem to lack trimming on the artical name (they have a leading space).
So, again if anyone is maintaining this script and has a later version I could try then that would be good.
- Andy
-
Hi,
you are on the way.
You need this:
http://www.freedict.de/download/wiki2bedic.pl (http://www.freedict.de/download/wiki2bedic.pl)
...
[div align=\"right\"][{POST_SNAPBACK}][/a][/div] (http://index.php?act=findpost&pid=56911\")
I know this is a quite old issue, but maybe somebody can help.
I would like to build a bedic file from the latest wikipedia dumps. I downloaded the latest dumps from [a href=\"http://download.wikimedia.org/]http://download.wikimedia.org/[/url], wiki2bedic.pl from http://www.freedict.de/download/wiki2bedic.pl (http://www.freedict.de/download/wiki2bedic.pl), but on running wiki2bedic.pl, I got either:
'Cannot opendir /usr/src/packages/zaurus/wikip/1/wiki/de/ : No such file or directory'
or the script was running forever.
Is there a newer version of wiki2bedic.pl? Do I need any other software? Do I use the right Wikipedia dump files?
Thanks.
-
Hi,
you are on the way.
You need this:
http://www.freedict.de/download/wiki2bedic.pl (http://www.freedict.de/download/wiki2bedic.pl)
...
[div align=\"right\"][{POST_SNAPBACK}][/a][/div] (http://index.php?act=findpost&pid=56911\")
I know this is a quite old issue, but maybe somebody can help.
I would like to build a bedic file from the latest wikipedia dumps. I downloaded the latest dumps from [a href=\"http://download.wikimedia.org/]http://download.wikimedia.org/[/url], wiki2bedic.pl from http://www.freedict.de/download/wiki2bedic.pl (http://www.freedict.de/download/wiki2bedic.pl), but on running wiki2bedic.pl, I got either:
'Cannot opendir /usr/src/packages/zaurus/wikip/1/wiki/de/ : No such file or directory'
or the script was running forever.
Is there a newer version of wiki2bedic.pl? Do I need any other software? Do I use the right Wikipedia dump files?
Thanks.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=69223\"][{POST_SNAPBACK}][/a][/div]
You have to either changge the script or create the mentioned directory before you run the script.
-
Hi,
I've used the wiki2bedic.pl script (it took 10 hours on the English Wikipedia) and I get a bedic.dic file of 1.3 GB
When I do xerox -d bedic.dic bedic2.dic it works, no errors...
But the resulting file is about 28 MB large.
I'm using the libbbedic version 0.91 with the dictionary.cpp patch.
The 0.94 doesn't compile with the dictionary.cpp patch, and unpatched it will work with the same 28MB file as before...
Any suggestions?
Cheers
Yannick
-
Hi,
I've used the wiki2bedic.pl script (it took 10 hours on the English Wikipedia) and I get a bedic.dic file of 1.3 GB
When I do xerox -d bedic.dic bedic2.dic it works, no errors...
But the resulting file is about 28 MB large.
I'm using the libbbedic version 0.91 with the dictionary.cpp patch.
The 0.94 doesn't compile with the dictionary.cpp patch, and unpatched it will work with the same 28MB file as before...
Any suggestions?
Cheers
Yannick
[div align=\"right\"][a href=\"index.php?act=findpost&pid=69984\"][{POST_SNAPBACK}][/a][/div]
0.9.4 already contains the patch from dictionary.cpp. 28 MB seems to be too good result for the compression . Could you send me your wiki2bedic.pl script and the URL from where you downloaded Wikipedia dump, so I can take a look what goes wrong in xerox.
-
Hi,
thanks for your help :-) Yeah, 28 megs sounds like a revolutionary compression. Maybe we should patent it? arg, no, forget it
Here is where I downloaded the dump :
http://download.wikimedia.org/archives/en/...r_table.sql.bz2 (http://download.wikimedia.org/archives/en/20050209_cur_table.sql.bz2)
The script I'm using is the following :
http://elelome.files5.free.fr/wiki2bedic.pl (http://elelome.files5.free.fr/wiki2bedic.pl)
Thanks soo much
Cheers
Yannick
PS : in case it's needed my e-mail is yannickd AT gmail DOT com or (for gmail haters )
yannick.dutertre AT enst-bretagne DOT fr
-
thanks for your help :-) Yeah, 28 megs sounds like a revolutionary compression.
I have the same problem using the wiki2bedic.pl from freedict end up with a file that is waaaay too small..
-
While I'm at it, is zbedic able to display pictures (meaning, do I have to bother with the images and the LaTeX things?) ?
Thanks
Yannick
-
While I'm at it, is zbedic able to display pictures (meaning, do I have to bother with the images and the LaTeX things?) ?
Thanks
Yannick
[div align=\"right\"][a href=\"index.php?act=findpost&pid=70275\"][{POST_SNAPBACK}][/a][/div]
zbedic dictionary file cannot contain any images but since it displays html text, wikipedia articles can theoretically refer to external image files. It may work if use only absolute path to images. I have never tried it so there is no guarantee it works. And of couse you need huge storage space.
So far I haven't found time to check why xerox fails with wikipedia file, but it is on my todo list.
-
Hi,
I've used the wiki2bedic.pl script (it took 10 hours on the English Wikipedia) and I get a bedic.dic file of 1.3 GB
When I do xerox -d bedic.dic bedic2.dic it works, no errors...
But the resulting file is about 28 MB large.
I'm using the libbbedic version 0.91 with the dictionary.cpp patch.
The 0.94 doesn't compile with the dictionary.cpp patch, and unpatched it will work with the same 28MB file as before...
Any suggestions?
Cheers
Yannick
[div align=\"right\"][a href=\"index.php?act=findpost&pid=69984\"][{POST_SNAPBACK}][/a][/div]
Ok, I found the problem. There is an entry in wikipedia that is longer than 500000 bytes, which is the limit set by wiki2bedic.pl script. If this limit is exceeded, xerox fails without printing out any error :-( (Currently I work on the new version of xerox, which should be more informative on errors).
To fix the problem, just change the line in wiki2bedic.pl from:
print PAGE "max-entry-length=500000\n";
to
print PAGE "max-entry-length=1024000\n";
Have fun!
-
could someone host the english wikipedia dic file somewhere. the one in zbedic's site is a bit outdated.
tovarish
-
Thanks so much rafm, I'm converting the English file of February the ninth. I hope it will work.
I've noticed something though : apparently, the new dumps (from 2005/03/09 ) are not compatible with wiki2bedic.pl, either bunzipped or not, which would imply a slight change in the sql format wikimedia uses.
Cheers
Yannick
-
Thanks so much rafm, I'm converting the English file of February the ninth. I hope it will work.
Thanks from me too, im busy converting the exact same file :-) Its onto the xerox stage now... thanks rafm
-
could someone host the english wikipedia dic file somewhere. the one in zbedic's site is a bit outdated.
tovarish
[div align=\"right\"][a href=\"index.php?act=findpost&pid=70660\"][{POST_SNAPBACK}][/a][/div]
If someone makes it I'll put it up on my site, it's a .mac site so no worrys on bandwidth!!
-
could someone host the english wikipedia dic file somewhere. the one in zbedic's site is a bit outdated.
tovarish
[div align=\"right\"][a href=\"index.php?act=findpost&pid=70660\"][{POST_SNAPBACK}][/a][/div]
If someone makes it I'll put it up on my site, it's a .mac site so no worrys on bandwidth!!
[div align=\"right\"][a href=\"index.php?act=findpost&pid=70710\"][{POST_SNAPBACK}][/a][/div]
yes I would really appreciate it, I dont have the resources (disk space and ram) to convert it myself.
tovarish
-
ive made a wikipedia.dic.dz from Febuary 9th 2005 - I needed about 5Gb of space, 2Gb for the original sql dump, 1.3 Gb for the wikibedic.dic and another 1.3Gb for the xeroxed wikipedia.dic version. Then a final 0.5Gb for the compressed version.
The file came to 1.3Gb after the xerox process. dictzip wikipedia.dic gave me a 412Mb File.
This loads into zbedic and passes the integrity check, i noticed some textual problems, but mainly the program seems to lock anytime i search for something past N in the alphabet. I have a 32Mb swap file activated, i'll experiment some more when I get back home later and see if it is actually usable.
If it works well, I dont mind uploading it somewhere - I can do it later this week from university on a very fast (hopefully) connection. I'll keep the thread posted if anyone is interested.
-
Do you have to download both the old and current or will just the current do? Also what are the steps for converting? Haven't really been able to find a how-to.
-
ive made a wikipedia.dic.dz from Febuary 9th 2005 - I needed about 5Gb of space, 2Gb for the original sql dump, 1.3 Gb for the wikibedic.dic and another 1.3Gb for the xeroxed wikipedia.dic version. Then a final 0.5Gb for the compressed version.
The file came to 1.3Gb after the xerox process. dictzip wikipedia.dic gave me a 412Mb File.
This loads into zbedic and passes the integrity check, i noticed some textual problems, but mainly the program seems to lock anytime i search for something past N in the alphabet. I have a 32Mb swap file activated, i'll experiment some more when I get back home later and see if it is actually usable.
If it works well, I dont mind uploading it somewhere - I can do it later this week from university on a very fast (hopefully) connection. I'll keep the thread posted if anyone is interested.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=70793\"][{POST_SNAPBACK}][/a][/div]
Did this ever make it anywhere? I'd love a current wikipedia - especially with all the free HD space in the 3000
-
Sooner or later I would like to put all wikipedia files for english and other languages at zbedic web page.
I will also check why zbedic locks past letter "N".
-
I'm not sure its letter related.. I noticed this too with earlier version of Wikipedia.. it may have something to do with size.. or I may be wrong.
This is why I ended up running Wikipedia as a set of static web pages rather than a dictionary but it would be nice to get it sorted.
-
I'm currently running a June 2004 Wikipedia without issue. It's addictive - I wasted a couple of hours reading random entries after I installed it.
This one seems too old to have a Zaurus entry, though. =(
-
I'm currently running a June 2004 Wikipedia without issue. It's addictive - I wasted a couple of hours reading random entries after I installed it.
This one seems too old to have a Zaurus entry, though. =(
[div align=\"right\"][a href=\"index.php?act=findpost&pid=73314\"][{POST_SNAPBACK}][/a][/div]
Whats the current status of wikipedia for Zaurus? How can I set it up as Noob?
All the reading throughout the Internet didn´t helped me. :-(
-
Whats the current status of wikipedia for Zaurus? How can I set it up as Noob?
All the reading throughout the Internet didn´t helped me. :-(
[div align=\"right\"][a href=\"index.php?act=findpost&pid=73378\"][{POST_SNAPBACK}][/a][/div]
Update: I found and fixed the problem in zbedic with the latest Wikipedia dump (the file size caused overflow in some arithmetic operations). The fix will be included in the upcoming 0.9.5 release. The latest Wikipedia for zbedic will probably (if quota allows) be available at zbedic home page.
-
nice
-
ZBEDic 0.9.5 with the lock-at-huge-file bug fixed is available at zbedic home page (http://bedic.sf.net/). A newer english Wikipedia file can be downloaded from the SourceForge project page as well.
Have fun!
-
I'm glad you've been updating this app, it's one of the highlights of having my Z. For those wanting the newer wiki, be aware that it's grown a little .... from 191M to a whopping 412M
-
I'm glad you've been updating this app, it's one of the highlights of having my Z. For those wanting the newer wiki, be aware that it's grown a little .... from 191M to a whopping 412M
[div align=\"right\"][a href=\"index.php?act=findpost&pid=75660\"][{POST_SNAPBACK}][/a][/div]
That's what gigabyte+ flash cards and C3000's were made for.
-
I've been downloading it for the past 4 hours!! 2 left. Good thing I have a 1Gb sd and 1Gb cf card!
-
I've been downloading it for the past 4 hours!! 2 left. Good thing I have a 1Gb sd and 1Gb cf card!
[div align=\"right\"][a href=\"index.php?act=findpost&pid=75691\"][{POST_SNAPBACK}][/a][/div]
Most of the North American sourceforge mirrors don't seem to have the file yet. I had to grab it from somewhere in Europe at ~40kb/s.
I haven't installed it yet, as I'm currently mucking with everything, so I don't have a stable platform to bother copying a 420mb file to. Definitely looking forward to it though!
-
I've been downloading it for the past 4 hours!! 2 left. Good thing I have a 1Gb sd and 1Gb cf card!
[div align=\"right\"][a href=\"index.php?act=findpost&pid=75691\"][{POST_SNAPBACK}][/a][/div]
Most of the North American sourceforge mirrors don't seem to have the file yet. I had to grab it from somewhere in Europe at ~40kb/s.
I haven't installed it yet, as I'm currently mucking with everything, so I don't have a stable platform to bother copying a 420mb file to. Definitely looking forward to it though!
[div align=\"right\"][a href=\"index.php?act=findpost&pid=75695\"][{POST_SNAPBACK}][/a][/div]
the one from Paris, France is not bad. keeping steady for me at ~150kb/s right now.
-
i cannot use the wikipedia dictionary. It shows status: entry too long and tge checkbox is grayed.
tovarish
-
Same here. Crap, I already deleted the old one....
-
Same here. Crap, I already deleted the old one....
[div align=\"right\"][a href=\"index.php?act=findpost&pid=75710\"][{POST_SNAPBACK}][/a][/div]
probably have to install the newer zbedic too
-
Nope had teh new version already installed. Tried redownloading the new wikipedia, twice. Same result, file too big. Am redownloading the old one now.
-
anyone got the new wikipedia to work?
-
anyone got the new wikipedia to work?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=75753\"][{POST_SNAPBACK}][/a][/div]
It might be that I uploaded a broken file to the SF. I will check it today. Meanwhile I have hiden the file so people don't waste bandwidth and time. Sorry.
The lastest Wikipedia requires ZBEDic 0.9.5.
-
Ok, the file with the english wikipedia at sf.net should be correct now. The previous one was actually damaged.
I uploaded the corrected file, downloaded and checked it. Everything seemed to work.
File: en-wikipedia_0.9.5_20050209.dic.dz
Size: 432311254 bytes
MD5SUM: 3216ee0a94009621522f42d5a5e6b48e
You need zbedic 0.9.5 to use that wikipedia file!
-
Ok, the file with the english wikipedia at sf.net should be correct now. The previous one was actually damaged.
I uploaded the corrected file, downloaded and checked it. Everything seemed to work.
File: en-wikipedia_0.9.5_20050209.dic.dz
Size: 432311254 bytes
MD5SUM: 3216ee0a94009621522f42d5a5e6b48e
You need zbedic 0.9.5 to use that wikipedia file!
[div align=\"right\"][a href=\"index.php?act=findpost&pid=76077\"][{POST_SNAPBACK}][/a][/div]
I'll give it a go today. Hopefully the wireless is up to it - I'll be downloading straight to the Z.
EDIT: Nope. Not a chance. I'll be downloading this one at home.
-
I'll give it a go today. Hopefully the wireless is up to it - I'll be downloading straight to the Z.
EDIT: Nope. Not a chance. I'll be downloading this one at home.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=76125\"][{POST_SNAPBACK}][/a][/div]
Just let me know if it works for you.
-
it worked for me but lot of the text had "\n"s in them.
its nice though to have it in the Z
tovarish
-
Fixed version works great. Quick question though, which set of fonts has all the cool extra's like the pi symbol and stuff like that??
-
Thanks for the efforts! To me the wikipedia dump itself is quite a killing factor for getting a Z.
Got the same issue as posted by others: quite a number of links and texts become either /n or /n*. And I also find differences betweent the entries on the website and the dump.
Please keep it up!! Look forward to seeing a more improved version!
-
Fixed version works great. Quick question though, which set of fonts has all the cool extra's like the pi symbol and stuff like that??
[div align=\"right\"][a href=\"index.php?act=findpost&pid=76492\"][{POST_SNAPBACK}][/a][/div]
Math symbols probably won't work if there are shown in Wikipedia as images.
-
There are quite a few blank articals (try CoventGarden) and still quite a few embedded line feeds that haven't been interpreted in the WIKIPEDIA stuff.
Every time I look at these scripts I think it's an uphill struggle because I don't know perl well enough... might have a go at something in C++.
-
Another blank article is A-10ThunderboltII. It has some on-screen corruption under the title as well.
-
Does anyone have a version of wiki2bedic.pl that would work on the latest SQL dumps? My version (marked in the comments 0.9 (7.1.2004)) only goes into infinite loop when run on de.wikipedia or pl.wikipedia.
It may be good idea to put wiki2bedic.pl under the cvs of the bedic SourceForge project. I would also make some links from the zbedic home page to that file.
-
I have a version that is (somewhat) working. At least it doesn't go to an infinite loop. The content of the entries is not perfect -- i see '\n', {sa} etc., but I don't have time (and free space on my laptop) to fix it.
-
I am currently working on new dumps for wikipedia (checkout http://www.crispy-cow.de/wikimedia/ (http://www.crispy-cow.de/wikimedia/)). Hope Rafal ("rafm") and me can work together on improving things
-
Fixed version works great. Quick question though, which set of fonts has all the cool extra's like the pi symbol and stuff like that??
[div align=\"right\"][a href=\"index.php?act=findpost&pid=76492\"][{POST_SNAPBACK}][/a][/div]
Math symbols probably won't work if there are shown in Wikipedia as images.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=78555\"][{POST_SNAPBACK}][/a][/div]
They were there with the last version, maby it's changed.
-
I have a version that is (somewhat) working. At least it doesn't go to an infinite loop. The content of the entries is not perfect -- i see '\n', {sa} etc., but I don't have time (and free space on my laptop) to fix it.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=79271\"][{POST_SNAPBACK}][/a][/div]
Could you put your version of the script to the CVS of the bedic SF project.
Thanks.
-
I am currently working on new dumps for wikipedia (checkout http://www.crispy-cow.de/wikimedia/ (http://www.crispy-cow.de/wikimedia/)). Hope Rafal ("rafm") and me can work together on improving things
[div align=\"right\"][a href=\"index.php?act=findpost&pid=79452\"][{POST_SNAPBACK}][/a][/div]
I have been looking at the quality of some of the dumps in BEDIC format, have seen some of the \n {} type artifacts, blank articals etc. and have always felt a little impotent about being to help given that I really don't have the pre-requisite perl skillsets necessary to intimately understand the scripts. I am thinking about producing something in C++ capable of doing this with extensible markup translation parsers for this project. Initially I have written an if extending ring-buffer module capable of reading articals into memory for processing using the minimum amount of RAM but accomodating some of the larger articals.
I have tested this ring buffer technique allowing it to read the current dumps which include archive articals approximately 3Mb in size (giving ~1500 x 512byte buffers in the ring).
My next step is to work on the markup translation and therefore am going to need a complete understanding of the markup used in the Wikipedia articals (this should be fairly easy - I expect that this takes the documented Wiki tags directly) and the ZBedic markup tags.
I know that the libbedic/doc directory describes the database format and tags used in the markup but I wanted first of all to check if this is up to date or if I should be pulling the markup render apart for Zbedic to determine new tags.... or if anyone has a more up to date list of markup tags could they possibly share please ?
- Andy
-
There are new wikis in German and English made from Christian Geyer on his
homepage http://www.crispy-cow.de/wikimedia/ (http://www.crispy-cow.de/wikimedia/)
Uli
-
It doesn't look like he's got the English Wikipedia up there yet. Just the German one.
-
It doesn't look like he's got the English Wikipedia up there yet. Just the German one.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=81949\"][{POST_SNAPBACK}][/a][/div]
Currently working on this. I will announce it, when I am finished.
-
I know that the libbedic/doc directory describes the database format and tags used in the markup but I wanted first of all to check if this is up to date or if I should be pulling the markup render apart for Zbedic to determine new tags.... or if anyone has a more up to date list of markup tags could they possibly share please ?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=79469\"][{POST_SNAPBACK}][/a][/div]
libbedic/doc/bedic-format.txt has an up to date specification of all the tags (all that is available in zbedic 0.9.5). However, I currently work on an extended set of tags. Send me your email as PM, so I can send you a draft of a new specification.
Additionally, zbedic can display most of the HTML tags (everything what QTextBrowser can show), so converting Wikipedia articles to zbedic format should not be so difficult.
To consolidate the work on wikipedia -> zbedic converters, I would strongly suggest developing such converters under the CVS of the bedic SF project (or perhaps as a new SF project). There seems to be quite a few people interested in developing such software, but there is no much coordination and the sources are not published, so most of the effort is unfortunately lost. If anyone is interested in developing such a converter under bedic CVS (sources should be under GPL), please send me a PM.