Rename .dz to .gz and use an unzip program with something (like Aladdin Expander).
Then you get a really ugly text file.
I havent really checked this help text but maybe it's the same
The bedic dictionary file has two sections -- header and entries. The
dictionary files ends with '\000' character.
I. Header section
The header contains the dictionary properties. All data in the header is
encoded using UTF-8. The end of the header is marked with \0 character.
1. Properties
A property definition has the following format:
name '=' value LF
where LF is line feed character (\012). Both the name and the value can
contain 0 and LF characters that are encoded as defined:
\0000 \0033 \0060 (ESC '0')
\0012 \0033 \0156 (ESC 'n')
\0033 \0033 \0145 (ESC 'e')
where ESC is 033. The name cannot contain '=' (075) character.
These are the currently defined properties:
- id (required)
The name of the dictionary
- max-entry-length (default 8192)
Maximum length of a database entry
- max-word-length (default 50)
Maximum length of an word
- index
Dictionary index. The format of the index is described below
- compression-method (default none)
Compression method. Allowed values are 'none' and 'shcm'.
- shcm-tree (required if compression-method is shcm)
Data used by the shcm compression algorithm.
- search-ignore-chars (default '-.')
The characters defined in this property are ignored when a search
is performed. For example if the user typed 'b-all' the word
'ball' will be found.
- commentXX
Comments. Currently used for license information
All other properties are ignored.
2. Index property
If the index property exists, it will improve the speed for dictionary
lookup. The index contains collection of (word, offset) pairs. The words are
sorted (ignoring the character case and the characters defined in
search-ignore-chars). The offset is relative to the beginning of the entries
section of the file. The pairs are separated by \000 character, the word and
the offset in a pair are separated by \012 character (note that both
characters are encoded the same way as all the property values to \033 \0060
an \033 \0156).
II. Entries section
The entries section contains collection of dictionary entries. The entries
have variable size. The entries have two parts - word and meaning. The word
value should be unique. The entries are sorted by the word value (ignoring
the character case and the characters defined in search-ignore-chars).
1. Entries section format
<entry0> '\0' <entry1> ... <entryN> '\0'
The entries in the database are separated by \0 character.
2. Entries format
<word> '\012' <meaning>
The word and entry values are separated by \012 (LF) character.
3. Meaning format
The word meaning is not free-text format. It has a structure defined by
number of tags. The tags has the following format:
'{' tag-name '}' start tag
'{' '/' tag-name '}' end tag
'{' tag-name '/' '}' empty tag
A word has one or more _senses_. If the word has different homonyms or
meanings that are different parts of speech, they should be defined as
different senses.
For every sense _part-of-speech_ and _pronounciation_ may be defined.
In a sense one or more _sub-senses_ are defined. The sub-sense tags contain
a translation/definition of the word. They may contain pointers to different
word in the dictionary (_see-also_ tag), examples of usage (_example_) tag
or _headword_ tag.
Tags:
sense {s}
sub-sense {ss}
part-of-speech {ps}
pronounciation {pr}
see-also {sa}
example {ex}
headword {hw/}
Example:
The meaning of the word test1
{s}{ps}n{/ps}{pr}test{/pr}{ss}sub-sense one{/ss}{ss}sub-sense two{/ss}{/s}
{s}{ps}v{/ps}{pr}test{/pr}{ss}see also {sa}hop{/sa}{/ss}{/s}
will be shown as
test1 n /test/
1. sub-sense one
2. sub-sense two
-------------------------------
test1 v /test/
see also _hop_
III. SHCM Compression
bedic supports simple Huffman code based compression. The compression method
is based on shcodec program by Alexander Simakov <xander@online.ru>.
To enable fast lookups, every word and meaning are compressed separately.
The special characters in the compressed data '\0', '\n', and '\033' are
encoded the same way as in the property values.
Version: 0.8
Author: Latchesar Ionkov <lucho@ionkov.net>
Last Modified: 06/11/2002
IV. Step by step guide for building new dictionaries
1. Prepare plain-text dictionary file in the format described in
sections I and II. The only properties you should care of is 'id' and
perhaps 'search-ignore-chars'.
2. If necessary, convert from the character encoding of the dictionary file
to utf8. 'konwert' program can do the work in most cases.
3. To create index and fill missing properties, run xerox:
xerox -d raw_data.dic plde-0.9.0.dic
You can find xerox in libbedic/tools. -d option disables
compression shcm, which is less efficient than dictzip.
4. Run dictzip tool (
www.dict.org) on the dictionary to compress it:
dictzip plde-0.9.0.dic
It should replace plde-0.9.0.dic with much smaller
plde-0.9.0.dic.dz. The dictionary file is ready to be used with
zbedic.