OESF | ELSI | pdaXrom | OpenZaurus | Zaurus Themes | Community Links | Ibiblio

IPB

Welcome Guest ( Log In | Register )

2 Pages V  < 1 2  
Reply to this topicStart new topic
> Building An Epwing Wikipedia, (Epwing Format)
spartan
post Apr 12 2006, 02:32 PM
Post #16





Group: Members
Posts: 82
Joined: 17-November 04
Member No.: 5,501



Yes, an EPWING version is very doable. If I use Mr. Löffler's markup-parser scripts, the issue will be the Perl code that builds the actual EPWING dictionary. It would not be difficult to transform the 4.8 GB English Wikipedia XML into a document in this markup.

I should mention that I have built a .Net 1.1-compatible library for manipulating EPWING files based on FreePWING (it will only run on Windows because of how the Perl intepreter is packaged).
Go to the top of the page
 
+Quote Post
spartan
post Apr 12 2006, 04:27 PM
Post #17





Group: Members
Posts: 82
Joined: 17-November 04
Member No.: 5,501



Here is the problem:

I've made the "eword", "head", "text", "textref", "texttag", and "word" files. What do I do to turn them into a "honmon" and "catalogs"? The Google translation of the FreePWING documentation isn't much good.

(http://www.sra.co.jp/people/m-kasahr/freepwing/doc/freepwing.html)

I'm under the impression that I use "fpwmake" with a specially crafted Makefile to produce a "honmon" and "catalogs".

(http://www.sra.co.jp/people/m-kasahr/freepwing/doc/freepwing-02.html#Makefile)

I add "include fpwutils.mk" into the Makefile and perform...
CODE
% perl /usr/local/libexec/freepwing/fpwsort
% perl /usr/local/libexec/freepwing/fpwindex
% perl /usr/local/libexec/freepwing/fpwcontrol
% perl /usr/local/libexec/freepwing/fpwlink

...in the directory with the "eword" et cetera files.

I end up with...
esort sort
ctrl eword text
ctrlref head textref
eidx0 idx0 texttag
eidxref0 idxref0 word
...but running "fpwmake" yields...
CODE
test -d work || /usr/local/libexec/freepwing/mkdirhier work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwhalfchar
\
  -workdir work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwfullchar
\
  -workdir work
/usr/local/libexec/freepwing/perl.sh   /usr/local/libexec/freepwing/fpwparser \
  -workdir work
Can't open perl script "/usr/local/libexec/freepwing/fpwparser": No such file or
directory
make: *** [work/parse.dep] Error 2

Since I used Mr. Loffler's markup parser, I don't think I need to run fpwparser.
I'm running this inside of Cygwin and performed a normal "./configure & make & make install" procedure on the FreePWING utilities. Is there something I'm missing?
Go to the top of the page
 
+Quote Post
icruise
post Apr 13 2006, 11:45 AM
Post #18





Group: Members
Posts: 292
Joined: 24-June 05
Member No.: 7,447



I read Japanese, but unfortunately my knowledge of this kind of thing is pretty limited. sad.gif Is there a specific sentence or sentences in the google translation that you would like translated into real English?
Go to the top of the page
 
+Quote Post
spartan
post Apr 13 2006, 07:33 PM
Post #19





Group: Members
Posts: 82
Joined: 17-November 04
Member No.: 5,501



Problem solved: a Makefile for a Loeffler-markup processed EPWING dictionary should read...
CODE
FPWPARSER = null.pl

include fpwutils.mk

...where null.pl is an empty file.

Then, create a catalogs.txt with the following...
CODE
[Catalog]
FileName   = catalogs
Type       = EPWING1
Books      = 1

[Book]
Title      = "Wikipedia-English"
BookType   = 6001
Directory  = "WIKI"

...replacing the title and directory as seen fit. The title must be EUC-JP encoded, so the above text would produce an error. Leaving the title space empty seems to work fine.
Go to the top of the page
 
+Quote Post
icruise
post Apr 14 2006, 12:19 AM
Post #20





Group: Members
Posts: 292
Joined: 24-June 05
Member No.: 7,447



That's good news. Does that mean that you're close to success? How big do you think the resulting files will be? Obviously, it would preferable if it would be under 4GB, so it could fit on the microdrive of the older Zaurus models, or on a 4GB SD card. I think most people want to avoid using the CF card slot for memory.

By the way, do you know if this same process can be done for the Japanese language version of the Wikipedia?
Go to the top of the page
 
+Quote Post
spartan
post Apr 14 2006, 06:52 AM
Post #21





Group: Members
Posts: 82
Joined: 17-November 04
Member No.: 5,501



I'
QUOTE(icruise @ Apr 14 2006, 08:19 AM)
That's good news. Does that mean that you're close to success? How big do you think the resulting files will be? Obviously, it would preferable if it would be under 4GB, so it could fit on the microdrive of the older Zaurus models, or on a 4GB SD card. I think most people want to avoid using the CF card slot for memory.

By the way, do you know if this same process can be done for the Japanese language version of the Wikipedia?
*


I could make an ugly Wikipedia now, bt it would be better to have a nicely formatted Wikipedia. I've confirmed with a test dictionary that the Epwing system works.

This process will work for the Japanese, and for that matter any, Wikipedia. It should work with all the other Wikis with the code I have now and could be easily extended to support any XML document.

For a size estimate, the bz2-compressed text-only English Wikipedia is about 1 GB.
Go to the top of the page
 
+Quote Post
icruise
post Apr 15 2006, 05:26 AM
Post #22





Group: Members
Posts: 292
Joined: 24-June 05
Member No.: 7,447



How long do you think it'll take to come up with a well formated version (and what exactly does nicely formatted mean)? Will there be hyperlinks within the text, or will that be left out? Also, given the size of the files, am I right in assume that this will this be a "do it yourself" kind of project (ie. you write the scripts and the people who want it download the raw wikipedia data and then run them to create the EPWING dictionary version)? If so, will it require a linux computer? I only have access to Windows and Mac boxes.
Go to the top of the page
 
+Quote Post
spartan
post Apr 15 2006, 08:50 AM
Post #23





Group: Members
Posts: 82
Joined: 17-November 04
Member No.: 5,501



QUOTE(icruise @ Apr 15 2006, 01:26 PM)
How long do you think it'll take to come up with a well formated version (and what exactly does nicely formatted mean)? Will there be hyperlinks within the text, or will that be left out? Also, given the size of the files, am I right in assume that this will this be a "do it yourself" kind of project (ie. you write the scripts and the people who want it download the raw wikipedia data and then run them to create the EPWING dictionary version)? If so, will it require a linux computer? I only have access to Windows and Mac boxes.
*


I have already generated a unformatted Wikipedia (in English and for a test, in Japanese), which uses only the FreePWING library's text encoder. The hyperlinks will not be active since I can't understand documentation which clues me on how to make inter-dictionary and internet hyperlinks. Everything else works.

The issue with distributing the program is that I packaged up Loeffler's parser and the FreePWING libraries with a commercial program into a .Net library. I don't think this package can be legally distributed, because I don't have the source to the packaging of the Perl runtime inside this package. The programs I wrote require Windows, this library, and Cygwin (Linux on Windows). I'm going to BitTorrent the Wikipedia. Eventually, somebody with bandwidth could host it.

Currently, I'm stuck on an encoding bug. The FreePWING parser wants ASCII text, but the Wikipedia is encoded between UTF8 and Unicode. That means I can format all day, but an accented character is seen by the parser as two characters, one of them invalid, instead of one character.
Go to the top of the page
 
+Quote Post
spartan
post Apr 19 2006, 04:56 PM
Post #24





Group: Members
Posts: 82
Joined: 17-November 04
Member No.: 5,501



Unfortunately, the FreePWING Perl library breaks at about 250 MBs worth of articles. I'll have a version reworked for the simplified bedic format and I'll try it with Xerox.
Go to the top of the page
 
+Quote Post
icruise
post Apr 20 2006, 03:37 AM
Post #25





Group: Members
Posts: 292
Joined: 24-June 05
Member No.: 7,447



QUOTE(spartan @ Apr 19 2006, 07:56 PM)
Unfortunately, the FreePWING Perl library breaks at about 250 MBs worth of articles. I'll have a version reworked for the simplified bedic format and I'll try it with Xerox.
*

Is it possible to break up the Wikipedia by letter to make each file smaller?
Go to the top of the page
 
+Quote Post
rafm
post Apr 20 2006, 10:19 PM
Post #26





Group: Members
Posts: 145
Joined: 13-November 04
Member No.: 5,449



QUOTE(spartan @ Apr 10 2006, 09:38 PM)
Thanks-I tried reading the source for Xerox to figure out what it does with a file in 'simplified bedic' format. Unfortunately, I don't really understand C.

I'm trying to write a program that will transform the new Wikipedia XML files into bedic and EPWING dictionaries (preferably EPWING). Since the Wikipedia-to-simplified-bedic conversion produces a file too large for Xerox to handle, I'm just building a C#/vb.net program to let people build an updated Wikipedia for themselves whenever they please.
*


It is bad a idea to duplicate the work of xerox or mkbedic. If mkbedic fails with your file in a simplified zbedic format, you can put somewhere (ftp/http) this file so I can download it and check what's wrong.
Go to the top of the page
 
+Quote Post
icruise
post May 1 2006, 11:15 PM
Post #27





Group: Members
Posts: 292
Joined: 24-June 05
Member No.: 7,447



So have you given up on the idea of the EPWING wikipedia? Even if it has to be split up into a bunch of different subdictionaries, I'm not sure it would make much difference in terms of usability, since programs like Zten can search multiple dictionaries at once.
Go to the top of the page
 
+Quote Post
spartan
post May 7 2006, 07:33 PM
Post #28





Group: Members
Posts: 82
Joined: 17-November 04
Member No.: 5,501



Sorry about the belated response icruise; that is a great idea. The encoding problem was solved, which means there will be no accented characters in the dictionary. I'll have it finished even sooner. I was already refactoring it to work with bedic, so I'll make a Wikipedia in both formats.
Go to the top of the page
 
+Quote Post
icruise
post May 25 2006, 02:34 AM
Post #29





Group: Members
Posts: 292
Joined: 24-June 05
Member No.: 7,447



Any news about this?
Go to the top of the page
 
+Quote Post
rolf
post Jul 23 2006, 01:24 PM
Post #30





Group: Members
Posts: 108
Joined: 5-October 04
Member No.: 4,884



I'd be interested to hear about this as well.

If other people are interested in joining forces to make this happen, come on over to http://gakusei.sf.net (I want the Japanese Wikipedia as a "kojien-replacement"). I am not yet decided on what format is best. plucker seems to be nice as well. The creator (?) of plucker seems to have been able to create a very nice plucker ebook out of wikipedia but he has shown no reaction to my mail so I am afraid the project might be dead.
Go to the top of the page
 
+Quote Post

2 Pages V  < 1 2
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 18th September 2014 - 07:48 PM