Help - Search - Members - Calendar
Full Version: Building An Epwing Wikipedia
OESF Forums > General Forums > General Support and Discussion > Software
spartan
Does anyone have an accurate definition of the EPWING format that is compatible with the C3000's ZDict?

Or, does anyone have a detailed description of what the bedic project's xerox application does and how it does it?
icruise
What do you mean by, a "definition?" It's a format used by Japanese CD-ROM dictionaries.
spartan
I mean specification; a description of the file format.
matthis
Epwing works very well with Zten, which is a nice app for the zaurus. I use it with Kojien without problems.
icruise
QUOTE(spartan @ Apr 9 2006, 09:09 PM)
I mean specification; a description of the file format.
*


I'm still not sure if I follow you. Some dictionary programs read text files that are written in a particular format (such as "word // definition") but that's not what EPWING is, just so we're clear. It's basically only used for CD-ROMs that are commercially sold in Japan, although there are a few programs that can convert (with varying degrees of success) between formats like System Soft and EPWING. But in any case, it's not the kind of thing that would allow you to easily make your own dictionary files. What is it that you want to do?
kurochka
QUOTE(spartan @ Apr 9 2006, 02:53 PM)
Or, does anyone have a detailed description of what the bedic project's xerox application does and how it does it?
*


By searching, you could have found these links where you could have found a detailed description of the bedic format and nuts and bolts of making bedic dictionaries:

http://www.oesf.org/forums/index.php?showtopic=16160&st=0
http://cvs.sourceforge.net/viewcvs.py/*che...mat.txt?rev=1.5
http://cvs.sourceforge.net/viewcvs.py/bedic/libbedic/doc/
http://bedic.sourceforge.net/index.html
spartan
QUOTE(kurochka @ Apr 10 2006, 04:59 PM)
QUOTE(spartan @ Apr 9 2006, 02:53 PM)
Or, does anyone have a detailed description of what the bedic project's xerox application does and how it does it?
*


By searching, you could have found these links where you could have found a detailed description of the bedic format and nuts and bolts of making bedic dictionaries:

http://www.oesf.org/forums/index.php?showtopic=16160&st=0
http://cvs.sourceforge.net/viewcvs.py/*che...mat.txt?rev=1.5
http://cvs.sourceforge.net/viewcvs.py/bedic/libbedic/doc/
http://bedic.sourceforge.net/index.html
*



Thanks-I tried reading the source for Xerox to figure out what it does with a file in 'simplified bedic' format. Unfortunately, I don't really understand C.

I'm trying to write a program that will transform the new Wikipedia XML files into bedic and EPWING dictionaries (preferably EPWING). Since the Wikipedia-to-simplified-bedic conversion produces a file too large for Xerox to handle, I'm just building a C#/vb.net program to let people build an updated Wikipedia for themselves whenever they please. In order to do that, I need to know how Xerox constructs the index and calculates the remaining fields.
It would be better to use the EPWING format for the Zaurus considering that I can put pictures and hyperlinks into it. I couldn't find anything under the libeb project that actually documents the construction of an EPWING file, so I'm wondering if anyone knows where I can find the specifications of the format.

Thanks again
icruise
First of all, let me say that I would be very very interested if you could get the Wikipedia converted to EPWING format. But my hunch is that it is a only available to commercial dictionary makers.
matthis
Freepwing lets you make epwing compatible files.
rokugo
QUOTE(spartan @ Apr 10 2006, 06:53 AM)
Does anyone have an accurate definition of the EPWING format that is compatible with the C3000's ZDict?
*


You could have gotten your answer by googling for "epwing format". It's in the very first hit (the creator of that website, Hannes Löffler happens to be a member of this forum).
icruise
QUOTE(rokugo @ Apr 11 2006, 01:51 AM)
QUOTE(spartan @ Apr 10 2006, 06:53 AM)
Does anyone have an accurate definition of the EPWING format that is compatible with the C3000's ZDict?
*


You could have gotten your answer by googling for "epwing format". It's in the very first hit (the creator of that website, Hannes Löffler happens to be a member of this forum).
*


Maybe things have changed since you performed your search, but it's not in the first hit now, or even on the first page. But here it is in any case:

http://www.hloeffler.info/epwing/

Does this mean that making an EPWING version of the Wikipedia is doable? I would love something like that. Right now I have an older version of the wikipedia for ZBEDic, but I don't like that program very much and all of my other dictionaries are in EPWING format. I also use an EPWING compatible dictionary program on my Mac in my work as a translator, and being able to add Wikipedia to that would be a great resource. Fingers crossed...
rokugo
QUOTE(icruise @ Apr 11 2006, 05:17 PM)
Maybe things have changed since you performed your search, but it's not in the first hit now, or even on the first page. But here it is in any case:

http://www.hloeffler.info/epwing/


What are you talking about? Its right there in the first hit of the first page:

http://www.google.com/search?client=opera&...=utf-8&oe=utf-8

How much more simpler can it get?
icruise
Well...

First I tried both "epwing format" (with quotes) and "epwing format" (without quotes) on google.co.jp (which is the site I usually search from). Neither way results in his site being first. I then tried Google.com and typed in "epwing format" with the quotes (as you wrote in your message above). This also doesn't result in the site being first. Finally, I found that going to google.com and typing "epwing format" without quotes does make it the first result. The point is that deriding someone for not noticing the first search result is not such a good idea, because the search results vary widely depending on the portal you use and whether you use quotes or not.
rokugo
For you to have implied that my info was wrong just because you couldn''t get the same results isn't such a good idea either. Who else uses google.jp here? And you don't need to do phrase searching (ie using quotation marks) unless the search results get too general.
Anyway I think my reply to the OP was clear and simple enough so I'll not belabour the point.
icruise
QUOTE(rokugo @ Apr 12 2006, 01:49 AM)
For you to have implied that my info was wrong just because you couldn''t get the same results isn't such a good idea either.
*

At the time, I thought it was wrong, because I didn't realize that using a different portal would give different results. But please, let's get back to the topic at hand.

Does an EPWING version of the wikipedia look doable?
spartan
Yes, an EPWING version is very doable. If I use Mr. Löffler's markup-parser scripts, the issue will be the Perl code that builds the actual EPWING dictionary. It would not be difficult to transform the 4.8 GB English Wikipedia XML into a document in this markup.

I should mention that I have built a .Net 1.1-compatible library for manipulating EPWING files based on FreePWING (it will only run on Windows because of how the Perl intepreter is packaged).
spartan
Here is the problem:

I've made the "eword", "head", "text", "textref", "texttag", and "word" files. What do I do to turn them into a "honmon" and "catalogs"? The Google translation of the FreePWING documentation isn't much good.

(http://www.sra.co.jp/people/m-kasahr/freepwing/doc/freepwing.html)

I'm under the impression that I use "fpwmake" with a specially crafted Makefile to produce a "honmon" and "catalogs".

(http://www.sra.co.jp/people/m-kasahr/freepwing/doc/freepwing-02.html#Makefile)

I add "include fpwutils.mk" into the Makefile and perform...
CODE
% perl /usr/local/libexec/freepwing/fpwsort
% perl /usr/local/libexec/freepwing/fpwindex
% perl /usr/local/libexec/freepwing/fpwcontrol
% perl /usr/local/libexec/freepwing/fpwlink

...in the directory with the "eword" et cetera files.

I end up with...
esort sort
ctrl eword text
ctrlref head textref
eidx0 idx0 texttag
eidxref0 idxref0 word
...but running "fpwmake" yields...
CODE
test -d work || /usr/local/libexec/freepwing/mkdirhier work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwhalfchar
\
  -workdir work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwfullchar
\
  -workdir work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwparser \
  -workdir work
Can't open perl script "/usr/local/libexec/freepwing/fpwparser": No such file or
directory
make: *** [work/parse.dep] Error 2

Since I used Mr. Loffler's markup parser, I don't think I need to run fpwparser.
I'm running this inside of Cygwin and performed a normal "./configure & make & make install" procedure on the FreePWING utilities. Is there something I'm missing?
icruise
I read Japanese, but unfortunately my knowledge of this kind of thing is pretty limited. sad.gif Is there a specific sentence or sentences in the google translation that you would like translated into real English?
spartan
Problem solved: a Makefile for a Loeffler-markup processed EPWING dictionary should read...
CODE
FPWPARSER = null.pl

include fpwutils.mk

...where null.pl is an empty file.

Then, create a catalogs.txt with the following...
CODE
[Catalog]
FileName   = catalogs
Type       = EPWING1
Books      = 1

[Book]
Title      = "Wikipedia-English"
BookType   = 6001
Directory  = "WIKI"

...replacing the title and directory as seen fit. The title must be EUC-JP encoded, so the above text would produce an error. Leaving the title space empty seems to work fine.
icruise
That's good news. Does that mean that you're close to success? How big do you think the resulting files will be? Obviously, it would preferable if it would be under 4GB, so it could fit on the microdrive of the older Zaurus models, or on a 4GB SD card. I think most people want to avoid using the CF card slot for memory.

By the way, do you know if this same process can be done for the Japanese language version of the Wikipedia?
spartan
I'
QUOTE(icruise @ Apr 14 2006, 08:19 AM)
That's good news. Does that mean that you're close to success? How big do you think the resulting files will be? Obviously, it would preferable if it would be under 4GB, so it could fit on the microdrive of the older Zaurus models, or on a 4GB SD card. I think most people want to avoid using the CF card slot for memory.

By the way, do you know if this same process can be done for the Japanese language version of the Wikipedia?
*


I could make an ugly Wikipedia now, bt it would be better to have a nicely formatted Wikipedia. I've confirmed with a test dictionary that the Epwing system works.

This process will work for the Japanese, and for that matter any, Wikipedia. It should work with all the other Wikis with the code I have now and could be easily extended to support any XML document.

For a size estimate, the bz2-compressed text-only English Wikipedia is about 1 GB.
icruise
How long do you think it'll take to come up with a well formated version (and what exactly does nicely formatted mean)? Will there be hyperlinks within the text, or will that be left out? Also, given the size of the files, am I right in assume that this will this be a "do it yourself" kind of project (ie. you write the scripts and the people who want it download the raw wikipedia data and then run them to create the EPWING dictionary version)? If so, will it require a linux computer? I only have access to Windows and Mac boxes.
spartan
QUOTE(icruise @ Apr 15 2006, 01:26 PM)
How long do you think it'll take to come up with a well formated version (and what exactly does nicely formatted mean)? Will there be hyperlinks within the text, or will that be left out? Also, given the size of the files, am I right in assume that this will this be a "do it yourself" kind of project (ie. you write the scripts and the people who want it download the raw wikipedia data and then run them to create the EPWING dictionary version)? If so, will it require a linux computer? I only have access to Windows and Mac boxes.
*


I have already generated a unformatted Wikipedia (in English and for a test, in Japanese), which uses only the FreePWING library's text encoder. The hyperlinks will not be active since I can't understand documentation which clues me on how to make inter-dictionary and internet hyperlinks. Everything else works.

The issue with distributing the program is that I packaged up Loeffler's parser and the FreePWING libraries with a commercial program into a .Net library. I don't think this package can be legally distributed, because I don't have the source to the packaging of the Perl runtime inside this package. The programs I wrote require Windows, this library, and Cygwin (Linux on Windows). I'm going to BitTorrent the Wikipedia. Eventually, somebody with bandwidth could host it.

Currently, I'm stuck on an encoding bug. The FreePWING parser wants ASCII text, but the Wikipedia is encoded between UTF8 and Unicode. That means I can format all day, but an accented character is seen by the parser as two characters, one of them invalid, instead of one character.
spartan
Unfortunately, the FreePWING Perl library breaks at about 250 MBs worth of articles. I'll have a version reworked for the simplified bedic format and I'll try it with Xerox.
icruise
QUOTE(spartan @ Apr 19 2006, 07:56 PM)
Unfortunately, the FreePWING Perl library breaks at about 250 MBs worth of articles. I'll have a version reworked for the simplified bedic format and I'll try it with Xerox.
*

Is it possible to break up the Wikipedia by letter to make each file smaller?
rafm
QUOTE(spartan @ Apr 10 2006, 09:38 PM)
Thanks-I tried reading the source for Xerox to figure out what it does with a file in 'simplified bedic' format. Unfortunately, I don't really understand C.

I'm trying to write a program that will transform the new Wikipedia XML files into bedic and EPWING dictionaries (preferably EPWING). Since the Wikipedia-to-simplified-bedic conversion produces a file too large for Xerox to handle, I'm just building a C#/vb.net program to let people build an updated Wikipedia for themselves whenever they please.
*


It is bad a idea to duplicate the work of xerox or mkbedic. If mkbedic fails with your file in a simplified zbedic format, you can put somewhere (ftp/http) this file so I can download it and check what's wrong.
icruise
So have you given up on the idea of the EPWING wikipedia? Even if it has to be split up into a bunch of different subdictionaries, I'm not sure it would make much difference in terms of usability, since programs like Zten can search multiple dictionaries at once.
spartan
Sorry about the belated response icruise; that is a great idea. The encoding problem was solved, which means there will be no accented characters in the dictionary. I'll have it finished even sooner. I was already refactoring it to work with bedic, so I'll make a Wikipedia in both formats.
icruise
Any news about this?
rolf
I'd be interested to hear about this as well.

If other people are interested in joining forces to make this happen, come on over to http://gakusei.sf.net (I want the Japanese Wikipedia as a "kojien-replacement"). I am not yet decided on what format is best. plucker seems to be nice as well. The creator (?) of plucker seems to have been able to create a very nice plucker ebook out of wikipedia but he has shown no reaction to my mail so I am afraid the project might be dead.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2014 Invision Power Services, Inc.