Author Topic: Building An Epwing Wikipedia  (Read 13778 times)

spartan

  • Jr. Member
  • **
  • Posts: 82
    • View Profile
Building An Epwing Wikipedia
« Reply #15 on: April 12, 2006, 06:32:38 pm »
Yes, an EPWING version is very doable. If I use Mr. Löffler's markup-parser scripts, the issue will be the Perl code that builds the actual EPWING dictionary. It would not be difficult to transform the 4.8 GB English Wikipedia XML into a document in this markup.

I should mention that I have built a .Net 1.1-compatible library for manipulating EPWING files based on FreePWING (it will only run on Windows because of how the Perl intepreter is packaged).
« Last Edit: April 12, 2006, 07:18:59 pm by spartan »
C3000 with Tetsu v18d Special Kernel and Sharp 1.11JP ROM
1GB Lexmark SD, 2GB Mini SD, Socket Revision H Bluetooth, Ambicom Wi-Fi

spartan

  • Jr. Member
  • **
  • Posts: 82
    • View Profile
Building An Epwing Wikipedia
« Reply #16 on: April 12, 2006, 08:27:23 pm »
Here is the problem:

I've made the "eword", "head", "text", "textref", "texttag", and "word" files. What do I do to turn them into a "honmon" and "catalogs"? The Google translation of the FreePWING documentation isn't much good.

(http://www.sra.co.jp/people/m-kasahr/freepwing/doc/freepwing.html)

I'm under the impression that I use "fpwmake" with a specially crafted Makefile to produce a "honmon" and "catalogs".

(http://www.sra.co.jp/people/m-kasahr/freepwing/doc/freepwing-02.html#Makefile)

I add "include fpwutils.mk" into the Makefile and perform...
Code: [Select]
% perl /usr/local/libexec/freepwing/fpwsort
% perl /usr/local/libexec/freepwing/fpwindex
% perl /usr/local/libexec/freepwing/fpwcontrol
% perl /usr/local/libexec/freepwing/fpwlink
...in the directory with the "eword" et cetera files.

I end up with...
esort    sort
ctrl                  eword    text
ctrlref               head     textref
eidx0                 idx0     texttag
eidxref0              idxref0  word
...but running "fpwmake" yields...
Code: [Select]
test -d work || /usr/local/libexec/freepwing/mkdirhier work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwhalfchar
\
   -workdir work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwfullchar
\
   -workdir work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwparser \
   -workdir work
Can't open perl script "/usr/local/libexec/freepwing/fpwparser": No such file or
 directory
make: *** [work/parse.dep] Error 2
Since I used Mr. Loffler's markup parser, I don't think I need to run fpwparser.
I'm running this inside of Cygwin and performed a normal "./configure & make & make install" procedure on the FreePWING utilities. Is there something I'm missing?
C3000 with Tetsu v18d Special Kernel and Sharp 1.11JP ROM
1GB Lexmark SD, 2GB Mini SD, Socket Revision H Bluetooth, Ambicom Wi-Fi

icruise

  • Sr. Member
  • ****
  • Posts: 292
    • View Profile
Building An Epwing Wikipedia
« Reply #17 on: April 13, 2006, 03:45:14 pm »
I read Japanese, but unfortunately my knowledge of this kind of thing is pretty limited.   Is there a specific sentence or sentences in the google translation that you would like translated into real English?

spartan

  • Jr. Member
  • **
  • Posts: 82
    • View Profile
Building An Epwing Wikipedia
« Reply #18 on: April 13, 2006, 11:33:20 pm »
Problem solved: a Makefile for a Loeffler-markup processed EPWING dictionary should read...
Code: [Select]
FPWPARSER = null.pl

include fpwutils.mk
...where null.pl is an empty file.

Then, create a catalogs.txt with the following...
Code: [Select]
[Catalog]
FileName   = catalogs
Type       = EPWING1
Books      = 1

[Book]
Title      = "Wikipedia-English"
BookType   = 6001
Directory  = "WIKI"
...replacing the title and directory as seen fit. The title must be EUC-JP encoded, so the above text would produce an error. Leaving the title space empty seems to work fine.
« Last Edit: April 13, 2006, 11:35:32 pm by spartan »
C3000 with Tetsu v18d Special Kernel and Sharp 1.11JP ROM
1GB Lexmark SD, 2GB Mini SD, Socket Revision H Bluetooth, Ambicom Wi-Fi

icruise

  • Sr. Member
  • ****
  • Posts: 292
    • View Profile
Building An Epwing Wikipedia
« Reply #19 on: April 14, 2006, 04:19:34 am »
That's good news. Does that mean that you're close to success? How big do you think the resulting files will be? Obviously, it would preferable if it would be under 4GB, so it could fit on the microdrive of the older Zaurus models, or on a 4GB SD card. I think most people want to avoid using the CF card slot for memory.

By the way, do you know if this same process can be done for the Japanese language version of the Wikipedia?

spartan

  • Jr. Member
  • **
  • Posts: 82
    • View Profile
Building An Epwing Wikipedia
« Reply #20 on: April 14, 2006, 10:52:40 am »
I'
Quote
That's good news. Does that mean that you're close to success? How big do you think the resulting files will be? Obviously, it would preferable if it would be under 4GB, so it could fit on the microdrive of the older Zaurus models, or on a 4GB SD card. I think most people want to avoid using the CF card slot for memory.

By the way, do you know if this same process can be done for the Japanese language version of the Wikipedia?
[div align=\"right\"][a href=\"index.php?act=findpost&pid=123149\"][{POST_SNAPBACK}][/a][/div]

I could make an ugly Wikipedia now, bt it would be better to have a nicely formatted Wikipedia. I've confirmed with a test dictionary that the Epwing system works.

This process will work for the Japanese, and for that matter any, Wikipedia. It should work with all the other Wikis with the code I have now and could be easily extended to support any XML document.

For a size estimate, the bz2-compressed text-only English Wikipedia is about 1 GB.
C3000 with Tetsu v18d Special Kernel and Sharp 1.11JP ROM
1GB Lexmark SD, 2GB Mini SD, Socket Revision H Bluetooth, Ambicom Wi-Fi

icruise

  • Sr. Member
  • ****
  • Posts: 292
    • View Profile
Building An Epwing Wikipedia
« Reply #21 on: April 15, 2006, 09:26:06 am »
How long do you think it'll take to come up with a well formated version (and what exactly does nicely formatted mean)? Will there be hyperlinks within the text, or will that be left out? Also, given the size of the files, am I right in assume that this will this be a "do it yourself" kind of project (ie. you write the scripts and the people who want it download the raw wikipedia data and then run them to create the EPWING dictionary version)? If so, will it require a linux computer? I only have access to Windows and Mac boxes.

spartan

  • Jr. Member
  • **
  • Posts: 82
    • View Profile
Building An Epwing Wikipedia
« Reply #22 on: April 15, 2006, 12:50:08 pm »
Quote
How long do you think it'll take to come up with a well formated version (and what exactly does nicely formatted mean)? Will there be hyperlinks within the text, or will that be left out? Also, given the size of the files, am I right in assume that this will this be a "do it yourself" kind of project (ie. you write the scripts and the people who want it download the raw wikipedia data and then run them to create the EPWING dictionary version)? If so, will it require a linux computer? I only have access to Windows and Mac boxes.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=123297\"][{POST_SNAPBACK}][/a][/div]

I have already generated a unformatted Wikipedia (in English and for a test, in Japanese), which uses only the FreePWING library's text encoder. The hyperlinks will not be active since I can't understand documentation which clues me on how to make inter-dictionary and internet hyperlinks. Everything else works.

The issue with distributing the program is that I packaged up Loeffler's parser and the FreePWING libraries with a commercial program into a .Net library. I don't think this package can be legally distributed, because I don't have the source to the packaging of the Perl runtime inside this package. The programs I wrote require Windows, this library, and Cygwin (Linux on Windows). I'm going to BitTorrent the Wikipedia. Eventually, somebody with bandwidth could host it.

Currently, I'm stuck on an encoding bug. The FreePWING parser wants ASCII text, but the Wikipedia is encoded between UTF8 and Unicode. That means I can format all day, but an accented character is seen by the parser as two characters, one of them invalid, instead of one character.
C3000 with Tetsu v18d Special Kernel and Sharp 1.11JP ROM
1GB Lexmark SD, 2GB Mini SD, Socket Revision H Bluetooth, Ambicom Wi-Fi

spartan

  • Jr. Member
  • **
  • Posts: 82
    • View Profile
Building An Epwing Wikipedia
« Reply #23 on: April 19, 2006, 08:56:53 pm »
Unfortunately, the FreePWING Perl library breaks at about 250 MBs worth of articles. I'll have a version reworked for the simplified bedic format and I'll try it with Xerox.
C3000 with Tetsu v18d Special Kernel and Sharp 1.11JP ROM
1GB Lexmark SD, 2GB Mini SD, Socket Revision H Bluetooth, Ambicom Wi-Fi

icruise

  • Sr. Member
  • ****
  • Posts: 292
    • View Profile
Building An Epwing Wikipedia
« Reply #24 on: April 20, 2006, 07:37:36 am »
Quote
Unfortunately, the FreePWING Perl library breaks at about 250 MBs worth of articles. I'll have a version reworked for the simplified bedic format and I'll try it with Xerox.
[div align=\"right\"][a href=\"index.php?act=findpost&pid=123838\"][{POST_SNAPBACK}][/a][/div]
Is it possible to break up the Wikipedia by letter to make each file smaller?

rafm

  • Full Member
  • ***
  • Posts: 145
    • View Profile
Building An Epwing Wikipedia
« Reply #25 on: April 21, 2006, 02:19:55 am »
Quote
Thanks-I tried reading the source for Xerox to figure out what it does with a file in 'simplified bedic' format. Unfortunately, I don't really understand C.

I'm trying to write a program that will transform the new Wikipedia XML files into bedic and EPWING dictionaries (preferably EPWING). Since the Wikipedia-to-simplified-bedic conversion produces a file too large for Xerox to handle, I'm just building a C#/vb.net program to let people build an updated Wikipedia for themselves whenever they please. [div align=\"right\"][a href=\"index.php?act=findpost&pid=122560\"][{POST_SNAPBACK}][/a][/div]

It is bad a idea to duplicate the work of xerox or mkbedic. If mkbedic fails with your file in a simplified zbedic format, you can put somewhere (ftp/http) this file so I can download it and check what's wrong.
SL-C1000 w/ Cacko ROM 1.23

icruise

  • Sr. Member
  • ****
  • Posts: 292
    • View Profile
Building An Epwing Wikipedia
« Reply #26 on: May 02, 2006, 03:15:50 am »
So have you given up on the idea of the EPWING wikipedia? Even if it has to be split up into a bunch of different subdictionaries, I'm not sure it would make much difference in terms of usability, since programs like Zten can search multiple dictionaries at once.

spartan

  • Jr. Member
  • **
  • Posts: 82
    • View Profile
Building An Epwing Wikipedia
« Reply #27 on: May 07, 2006, 11:33:26 pm »
Sorry about the belated response icruise; that is a great idea. The encoding problem was solved, which means there will be no accented characters in the dictionary. I'll have it finished even sooner. I was already refactoring it to work with bedic, so I'll make a Wikipedia in both formats.
« Last Edit: May 07, 2006, 11:35:40 pm by spartan »
C3000 with Tetsu v18d Special Kernel and Sharp 1.11JP ROM
1GB Lexmark SD, 2GB Mini SD, Socket Revision H Bluetooth, Ambicom Wi-Fi

icruise

  • Sr. Member
  • ****
  • Posts: 292
    • View Profile
Building An Epwing Wikipedia
« Reply #28 on: May 25, 2006, 06:34:59 am »
Any news about this?

rolf

  • Full Member
  • ***
  • Posts: 105
    • View Profile
    • http://home.arcor.de/leggewie/
Building An Epwing Wikipedia
« Reply #29 on: July 23, 2006, 05:24:48 pm »
I'd be interested to hear about this as well.  

If other people are interested in joining forces to make this happen, come on over to http://gakusei.sf.net (I want the Japanese Wikipedia as a "kojien-replacement").  I am not yet decided on what format is best.  plucker seems to be nice as well.  The creator (?) of plucker seems to have been able to create a very nice plucker ebook out of wikipedia but he has shown no reaction to my mail so I am afraid the project might be dead.