Author Topic: Creating My Own Dictionary Files (Read 5689 times)

burao · « **on:** December 11, 2004, 02:55:01 am »

Hello All,
Over the past year or so, I have been creating my own Japanese/English dictionary entries using portabase. I have also loaded asorted zten and zbedic dictionaries.

What I would like, that I have not been able to accomplish so far, is to be able to create my own dictionary files for either zten or zbedic. I have seen Perl scripts that are supposed to work for freepwing format, but I can't figure out how to use them.
(The instructions are all in Japanese and also I am surrounded by WindowsXP machines ).

Another option, possibly even more interesting, would be a new interface to portabase.
I think it would be great to mimic the interface of zten or zbedic on a set of portabase files. (In other words, have a search box that displays matches from a portabase file and allows you to change to a difference portabase file quickly)

Finally, I am really interested in having wikipedia info on my machine as well, but I would love to be able to edit/pear down the size of the data if there was a way to do so.

Thanks,
Burao

---------------------------------------------
Zaurus 860, 700, Sharp ROM with XQT (Waiting on new QTopia)

g333 · « **Reply #1 on:** December 14, 2004, 02:26:46 pm »

I'm interested in the same thing. If we work together maybe we can figure it out. Two heads are better than one (2 hito no zunou wa 1 hito ni maseru)

burao · « **Reply #2 on:** December 14, 2004, 08:28:08 pm »

I have spoken with Jeremy Bowman who created Portabase and he is interested in the idea of a search bar.

The problem is he needs more help.
If anyone has programming skill,translation ability,testing time,etc.,
maybe you can send him a message.
His email address and more info is on the portabase website http://portabase.sourceforge.net/contribute.html

I still want to figure out how to create files for zten and zbedic.
If possible I want to do it on my beloved Microsoft OSed machine.
Any help would be appreciated.

g333 · « **Reply #3 on:** December 14, 2004, 08:51:18 pm »

Rename .dz to .gz and use an unzip program with something (like Aladdin Expander).

Then you get a really ugly text file.

I havent really checked this help text but maybe it's the same

The bedic dictionary file has two sections -- header and entries. The
dictionary files ends with '\000' character.

I. Header section

The header contains the dictionary properties. All data in the header is
encoded using UTF-8. The end of the header is marked with \0 character.

1. Properties

A property definition has the following format:

   name '=' value LF

where LF is line feed character (\012). Both the name and the value can
contain 0 and LF characters that are encoded as defined:

   \0000   \0033 \0060   (ESC '0')
   \0012   \0033 \0156   (ESC 'n')
   \0033   \0033 \0145   (ESC 'e')

where ESC is 033. The name cannot contain '=' (075) character.

These are the currently defined properties:

   - id         (required)
    The name of the dictionary

   - max-entry-length   (default 8192)
    Maximum length of a database entry

   - max-word-length   (default 50)
    Maximum length of an word

   - index
    Dictionary index. The format of the index is described below

   - compression-method   (default none)
    Compression method. Allowed values are 'none' and 'shcm'.

   - shcm-tree      (required if compression-method is shcm)
    Data used by the shcm compression algorithm.

   - search-ignore-chars   (default '-.')
    The characters defined in this property are ignored when a search
    is performed. For example if the user typed 'b-all' the word
    'ball' will be found.

   - commentXX
    Comments. Currently used for license information

All other properties are ignored.

2. Index property

If the index property exists, it will improve the speed for dictionary
lookup. The index contains collection of (word, offset) pairs. The words are
sorted (ignoring the character case and the characters defined in
search-ignore-chars). The offset is relative to the beginning of the entries
section of the file. The pairs are separated by \000 character, the word and
the offset in a pair are separated by \012 character (note that both
characters are encoded the same way as all the property values to \033 \0060
an \033 \0156).

II. Entries section

The entries section contains collection of dictionary entries. The entries
have variable size. The entries have two parts - word and meaning. The word
value should be unique. The entries are sorted by the word value (ignoring
the character case and the characters defined in search-ignore-chars).

1. Entries section format

   <entry0> '\0' <entry1> ... <entryN> '\0'

The entries in the database are separated by \0 character.

2. Entries format

   <word> '\012' <meaning>

The word and entry values are separated by \012 (LF) character.

3. Meaning format

The word meaning is not free-text format. It has a structure defined by
number of tags. The tags has the following format:

   '{' tag-name '}'   start tag
   '{' '/' tag-name '}'   end tag
   '{' tag-name '/' '}'   empty tag

A word has one or more _senses_. If the word has different homonyms or
meanings that are different parts of speech, they should be defined as
different senses.

For every sense _part-of-speech_ and _pronounciation_ may be defined.

In a sense one or more _sub-senses_ are defined. The sub-sense tags contain
a translation/definition of the word. They may contain pointers to different
word in the dictionary (_see-also_ tag), examples of usage (_example_) tag
or _headword_ tag.

Tags:

   sense      {s}
   sub-sense   {ss}
   part-of-speech   {ps}
   pronounciation   {pr}
   see-also   {sa}
   example      {ex}
   headword   {hw/}

Example:

The meaning of the word test1

   {s}{ps}n{/ps}{pr}test{/pr}{ss}sub-sense one{/ss}{ss}sub-sense two{/ss}{/s}
   {s}{ps}v{/ps}{pr}test{/pr}{ss}see also {sa}hop{/sa}{/ss}{/s}

will be shown as

   test1 n /test/

   1. sub-sense one
   2. sub-sense two

   -------------------------------

   test1 v /test/

   see also _hop_

III. SHCM Compression

bedic supports simple Huffman code based compression. The compression method
is based on shcodec program by Alexander Simakov <xander@online.ru>.

To enable fast lookups, every word and meaning are compressed separately.
The special characters in the compressed data '\0', '\n', and '\033' are
encoded the same way as in the property values.

Version: 0.8
Author: Latchesar Ionkov <lucho@ionkov.net>
Last Modified: 06/11/2002

IV. Step by step guide for building new dictionaries

1. Prepare plain-text dictionary file in the format described in
sections I and II. The only properties you should care of is 'id' and
perhaps 'search-ignore-chars'.

2. If necessary, convert from the character encoding of the dictionary file
to utf8. 'konwert' program can do the work in most cases.

3. To create index and fill missing properties, run xerox:

xerox -d raw_data.dic plde-0.9.0.dic

You can find xerox in libbedic/tools. -d option disables
compression shcm, which is less efficient than dictzip.

4. Run dictzip tool (www.dict.org) on the dictionary to compress it:

dictzip plde-0.9.0.dic

It should replace plde-0.9.0.dic with much smaller
plde-0.9.0.dic.dz. The dictionary file is ready to be used with
zbedic.

burao · « **Reply #4 on:** December 14, 2004, 10:01:23 pm »

Wow,
Thanks for the instructions. I will give it a shot later today.

Any luck on Freepwing format for zten?

g333 · « **Reply #5 on:** December 18, 2004, 02:00:16 am »

I just got the zten dictionary and it is great but I want to be able to edit EPWING format dictionaries now. One problem zten has is it can't read the dictionaries on the HDD of the Zaurus SL-C3000. All the ones off the SD card work.

Can anyone tell me how to add an entry or fix the parts I don't understand in the dictionary? I want to make an English example dictionary with Japanese translations in EPWING format.

g333 · « **Reply #6 on:** December 18, 2004, 09:20:21 pm »

Is there a way to install the same program a few times? I want to make icons for Zten that are set up with my preferred dictionary settings.

One would be English.
The next, Japanese to English.
Then, English to Japanese... and so on

halx · « **Reply #7 on:** December 18, 2004, 09:45:56 pm »

Quote

I just got the zten dictionary and it is great but I want to be able to edit EPWING format dictionaries now. One problem zten has is it can't read the dictionaries on the HDD of the Zaurus SL-C3000. All the ones off the SD card work.

Can anyone tell me how to add an entry or fix the parts I don't understand in the dictionary? I want to make an English example dictionary with Japanese translations in EPWING format.

On the C3000 you should already have zdic. No need to install zten as zdic is an improved version of zten.

You have some misconception about the EPWING format (see my answer in other thread). EPWING viewers are just that: viewers. They are not vocabulary list organizers/annotation tools etc. For that purpose you will have to look into other solutions (most likely not EPWING based).

halx · « **Reply #8 on:** December 18, 2004, 09:49:33 pm »

Quote

Is there a way to install the same program a few times? I want to make icons for Zten that are set up with my preferred dictionary settings.

One would be English.
The next, Japanese to English.
Then, English to Japanese... and so on

A somewhat strange concept organizing dics

Actually zdic should offer you some options to organize. If not have a look into ztenv.

g333 · « **Reply #9 on:** December 19, 2004, 03:02:58 am »

The zten can switch between dicts. quicker but it doesn't have a book mark yet. I live in Japan and not being able to select a dictionary quickly bugs me when I'm trying to talk to someone.

halx · « **Reply #10 on:** December 19, 2004, 07:02:25 am »

Quote

The zten can switch between dicts. quicker but it doesn't have a book mark yet. I live in Japan and not being able to select a dictionary quickly bugs me when I'm trying to talk to someone.

I just don't see why you need to switch between dicts. I simply group mine in dictionary groups.

g333 · « **Reply #11 on:** December 19, 2004, 09:19:27 pm »

If I want to use a txt dictionary or other type what would you recommend?

I like EDICT in EPWING format but I can't edit it.
As none of the program have a jump button. EDICT has everything written after the kanji Just not enought entries

I can open the EDICT from Jim's site and read it perfectly on my Japanese machine. I've almost finished reversing it too (in excel (English to Japanese)).

If you know how to write EPWING files on Windows Xp please tell me. Or even a way to use the dictionary info I have on my SL-C3000.

BTW I've only had the Zaurus for 6 days now but I've learnt so much.

halx · « **Reply #12 on:** December 19, 2004, 11:58:18 pm »

Quote

If I want to use a txt dictionary or other type what would you recommend?

I like EDICT in EPWING format but I can't edit it.
As none of the program have a jump button. EDICT has everything written after the kanji Just not enought entries

I can open the EDICT from Jim's site and read it perfectly on my Japanese machine. I've almost finished reversing it too (in excel (English to Japanese)).

If you know how to write EPWING files on Windows Xp please tell me. Or even a way to use the dictionary info I have on my SL-C3000.

BTW I've only had the Zaurus for 6 days now but I've learnt so much.

I use several commerical dictionaries so I don't need/want to modify existing entries.

If you really want to edit EDICT it would probably be nice to contribute to this project, e.g. via the online form on Jim Breen's site.

What is a "jump button"? If you are referring to the jump feautures in modern denshi jisho it is usually called cut-and-paste on other platforms.

"Write" EDICT files: You mean create EPWING file by conversion from existing data? The program is called ebstudio.

BTW, I guess you already tried to search

g333 · « **Reply #13 on:** December 21, 2004, 10:17:31 am »

Could you please explain how I make a dictionary using EBstudio? I have lots of txt based dictionaries and want to start using them.

g333 · « **Reply #14 on:** December 24, 2004, 11:46:35 am »

I found the dictzip program on http://www.ifis.uni-luebeck.de/~duc/Dict/install.html

"A version for Windows is also available. (dictzip is a command-line tool, please use it from a command windows!) "

News:

Author Topic: Creating My Own Dictionary Files (Read 5689 times)