OESF | ELSI | pdaXrom | OpenZaurus | Zaurus Themes | Community Links | Ibiblio

IPB

Welcome Guest ( Log In | Register )

3 Pages V   1 2 3 >  
Reply to this topicStart new topic
> Various Questions About Zbedic, its format, etc.
kurochka
post Nov 17 2005, 11:59 AM
Post #1





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



I have spent the last couple of days reading all the documents about ZBedic format (the old and the new simplified), man page for mkbedic and the example dictionary file. Here is what I did:

1. I compiled mkbedic.
2. I processed by mkbedic the example dic file (mkbedic example.dic dictionary.dic)
4. I installed the new dictionary on Zaurus into the directory where my other dictionaries are.
5. I used "search for dictionaries" function of ZBedic (alternatively, I wrote the path to the dictionary.dic into the zbedic conf file)

Nothing happened. For some reason, ZBedic could not recognize the resulting file. I know I didn't compress the new dic but I understand that it is not necessary. Then, just in case I processed the dictionary.dic (the file resulting from mkbedic) with xerox (although I understand that mkbedic is a replacement for xerox), this did not work either.

Could somebody walk me through the process and explain what I did wrong? Please either use the example.dic or a very simple dic file, like:

id= Dictionary

Word
{s}{ss}meaning{ss/}{s}

Give me examples, please.

Thank in advance
Go to the top of the page
 
+Quote Post
rafm
post Nov 17 2005, 02:46 PM
Post #2





Group: Members
Posts: 145
Joined: 13-November 04
Member No.: 5,449



QUOTE(kurochka @ Nov 17 2005, 08:59 PM)
Nothing happened.  For some reason, ZBedic could not recognize the resulting file.  I know I didn't compress the new dic but I understand that it is not necessary.
*


I am affraid that it may be necessary to compress the dictionary. At some point I removed the ".dic" extension from the MIME types, since some people had complained that zbedic was finding too many system files with the ".dic" extension. Now, it can recognize only ".dic.dz" files. It may still read ".dic" files if you add ".dic" to the MIME types, but I wouldn't bet on that.
Go to the top of the page
 
+Quote Post
kurochka
post Nov 26 2005, 07:38 PM
Post #3





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



I think I am making progress (but very slow). Here is another problem that I am facing.

I need to make a ZBedic dictionary (this is still a test dictionary to figure out the inner workings). I've prepared the text .dic file (size 3.7 MB). Mkbedic command runs without any errors but the resulting dictionary is exactly 0 bytes.

I've re-read the man page for mkbedic and it says that the command cannot process "very large files." When I make the .dic files smaller (just a couple of pages), then everything works and zbedic can access the dictionary.

My intention is to make a large dictionary (tens of thousands of entries about 40MB or more in txt format) after I figured out how everything works. So, I need to come up with a solution.

Does this mean I have to use xerox? If so, it will be tough for me. It took me a while to figure out the simplified format. I am not sure if I can do the original format. Can I use a text editor to prepare the original format dictionary file? If so, how do I enter 0 byte, etc.?

Can it be that the problem is not the size but something else? But there is no error. What is meant by the "very large files"?
Go to the top of the page
 
+Quote Post
kurochka
post Nov 26 2005, 09:06 PM
Post #4





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



Ok, one more question for those in the know.

I need to put pronunciation (or transcription) for entry words in IPA (International Phonetic Alphabet) http://en.wikipedia.org/wiki/IPA and http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm

There is a tag for pronunciation {pr} {/pr}. However, I think that none of the unicode fonts for Zs have IPA in them, right? I can't even find a font for Windows with IPA.

Does somebody know whether there is a font for Z that would work on VGA screens and include IPA? I haven't tested. Maybe unifont for Zbedic includes IPA. Does anybody know sure?
Go to the top of the page
 
+Quote Post
rafm
post Nov 27 2005, 01:51 AM
Post #5





Group: Members
Posts: 145
Joined: 13-November 04
Member No.: 5,449



QUOTE(kurochka @ Nov 27 2005, 04:38 AM)
I think I am making progress (but very slow).  Here is another problem that I am facing.

I need to make a ZBedic dictionary (this is still a test dictionary to figure out the inner workings).  I've prepared the text .dic file (size 3.7 MB).  Mkbedic command runs without any errors but the resulting dictionary is exactly 0 bytes. 

I've re-read the man page for mkbedic and it says that the command cannot process "very large files." When I make the .dic files smaller (just a couple of pages), then everything works and zbedic can access the dictionary.

Can it be that the problem is not the size but something else?  But there is no error.  What is meant by the "very large files"?
*


I should have been more precise: very large files means >2GB. So a few megabyte dictionary should work fine.

If mkbedic does not show any error and still you get 0 bytes file, this can be a bug. If possible, please send me this file to rafm at users.sourceforge.net, so I can check what goes wrong.
Go to the top of the page
 
+Quote Post
kurochka
post Nov 29 2005, 08:58 AM
Post #6





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



QUOTE(rafm @ Nov 27 2005, 01:51 AM)
I should have been more precise: very large files means >2GB. So a few megabyte dictionary should work fine.

If mkbedic does not show any error and still you get 0 bytes file, this can be a bug. If possible, please send me this file to rafm at users.sourceforge.net, so I can check what goes wrong.
*


Thanks, rafm. It is good news. None of my dictionaries will exceed 2GB (I guess it only matters for wickipedia and such other big projects).

The original problem may not actually be connected to mkbedic. The files were 0 at first. Then, after a while I looked at them again and they were of normal size. It's weird but now everything works - I just have to wait if the size is 0 and then it fixes itself.

I will keep this thread going by asking other questions and making suggestions and comments:

1. Emphasis Tag. I've noticed that {em} tag is hardly seen on the screen. I understand it makes letters bold but somehow they are almost indistinguishable from normal letters. I wonder if {em} could instead make the tagged text a different color to make it stand out. I am using this tag for showing accent in the words. BTW, if the full text search is implemented, will the {de} tags within the word break the search? Any other useful tags that can work for showing word accent (stress)?

2. Ways to Show Word Accent/Stress. As indicated earlier, I use {em} tags to show word stress in the words in the translation portion. This seems somewhat awkward now but it works for the translation portion. How can I show word accent for words for the keywords without ruining the search mechanism? One solution that might work (I am not sure though) -- I could probably use the Unicode stress symbol (I think there is a special symbol in Unicode for word accent) and put it in the ignore char list. Will this work? For example, if the keyword is "a'rmy" (showing stress on A) will the search for "army" locate the keyword if I use the above described approach?

3. Category Tag. The format description document indicates that each sense or subsense can have zero or one category taged text. I have noticed that this is not enforced by zbedic and the sense or subsense can have zero, one or more than one category taged text. This is great news as words/meanings can be of multiple categories (e.g., medicine and chemistry at the same time). So, please leave this as it is. I think the official format definition should be also amended to allow zero or any number of categories.

4. Part of Speech Tag. The format description states that each sense can have zero or one part of speech tagged text {ps}. This is enforced by zbedic. If a sense has more than one {ps} then only the first one is shown, the others disappear. This makes sense (pan intended). However, I just want to note that when converting from other dictionary formats it is to burdensome (probably, should be done manually) to convert the entries that have multiple parts of speech tags in one sense. The solution for me was to just use {de} instead of {ps} because the sense and subsense can have any number of {de}'s. I have noticed that Mueller English-Russian dictionary uses {ps} tag to put a Roman number for each sense (e.g., "a I" then a line and "a II" in the translation window). I think I will also use it for this purpose.

5. Strict Order of Opening/Closing Tags. Some dictionary formats (e.g., DSL for Lingvo) allow any order of tags as long as the closing tag follows the corresponding opening tag (e.g., [ex][cl] any word[ex][cl] in DSL). Zbedic enforces the order of closing tags depending on the order of opening tags -- the outer (inner) opening tag should have a corresponding outer (inner) closing tag (e.g., {ex}{de}Text{de}{ex} and not {ex}{de}Text{ex}{de}). This is just a note for others (some of my entries did not work because of this). I think it makes sense to enforce the order.

6. Pronunciation Tag. I know that a lot of dictionaries use IPA (International Phonetic Alphabet) for pronunciation/transcription of words. So far, I have not seen a font that supports IPA for Zaurus. Therefore, when converting a dictionary I just deleted the pronunciation portion, which is a shame. Maslovsky, do you know any fonts that have IPA and cyrillic in them?

7. Use of Senses and Subsenses. When converting dictionaries, I went the easier route of keeping the hardcoded separation of senses and subsenses (no tags, just text "1." "2." and "1)" and "2)" ). I just put the whole thing into one sense and subsense. The better way is to replace it with the Zbedic separation into {s} and {ss}. But it works anyway. Anybody sees a problem with this approach?

8. Conversion Process. Since I do not know any programming language, I just used the find and replace (including regular expressions) to convert the dictionaries in other formats into Zbedic format. I know that there are some scripts available but they are specific to the format from which the conversion takes place (Wikipedia, Muller). I would appreciate if people would share their scripts with the community here or at Zbedic SF site. Maybe I could adapt those for my use.

9. Making the Source Files for Dictionaries Available. I know that dic.dz can be opened and modified but I think it would be more accessable for those who want to learn the format and/or modify the text of the dictionary files to make available on SF site regular text .dic files (with the new mkbedic the source files are pure text).

10. New Line Break Tags. Looking at the example.dic I see that there are new tags available for line break {br/}. I guess this would be the only tag that does not have/need a corresponding second tag.

11. Just A Sense Without Subsenses? Don't know why I have not tried it yet but I wonder if there can be an entry with just a simple sense (e.g., {s}meaning{/s}) without subsenses? There are lots of words that require a simple one or two word translation and the {ss} tag seams redudant.
Go to the top of the page
 
+Quote Post
rafm
post Nov 29 2005, 01:51 PM
Post #7





Group: Members
Posts: 145
Joined: 13-November 04
Member No.: 5,449



QUOTE(kurochka @ Nov 29 2005, 05:58 PM)
1. Emphasis Tag. I've noticed that {em} tag is hardly seen on the screen.  I understand it makes letters bold but somehow they are almost indistinguishable from normal letters.  I wonder if {em} could instead make the tagged text a different color to make it stand out.  I am using this tag for showing accent in the words.  BTW, if the full text search is implemented, will the {de} tags within the word break the search?  Any other useful tags that can work for showing word accent (stress)?


I know that there is a problam with SL5500 - an HTML widget does not handle colors well. I will take a look why there is no much difference on SL-C series.

QUOTE
2. Ways to Show Word Accent/Stress.  As indicated earlier, I use {em} tags to show word stress in the words in the translation portion.  This seems somewhat awkward now but it works for the translation portion.  How can I show word accent for words for the keywords without ruining the search mechanism?  One solution that might work (I am not sure though) --  I could probably use the Unicode stress symbol (I think there is a special symbol in Unicode for word accent) and put it in the ignore char list.  Will this work?  For example, if the keyword is "a'rmy" (showing stress on A) will the search for "army" locate the keyword if I use the above described approach?


This should work. But wouldn't it be better to include stess in the pronunciation?

QUOTE
3. Category Tag.  The format description document indicates that each sense or subsense can have zero or one category taged text.  I have noticed that this is not enforced by zbedic and the sense or subsense can have zero, one or more than one category taged text.  This is great news as words/meanings can be of multiple categories (e.g., medicine and chemistry at the same time).  So, please leave this as it is.  I think the official format definition should be also amended to allow zero or any number of categories.


OK. I will update the specification.

QUOTE
4. Part of Speech Tag.  The format description states that each sense can have zero or one part of speech tagged text {ps}.  This is enforced by zbedic.  If a sense has more than one {ps} then only the first one is shown, the others disappear.  This makes sense (pan intended).  However, I just want to note that when converting from other dictionary formats it is to burdensome (probably, should be done manually) to convert the entries that have multiple parts of speech tags in one sense.  The solution for me was to just use {de} instead of {ps} because the sense and subsense can have any number of {de}'s.  I have noticed that Mueller English-Russian dictionary uses {ps} tag to put a Roman number for each sense (e.g., "a I" then a line and "a II" in the translation window).  I think I will also use it for this purpose.


This is specific to zbedic - each entry should be unique, otherwise search does not work. Currently it would be too much work to change it.

You should take a look at: http://www.freedict.org/en/ They store dictionaries in XML and they have scripts to convert from XML to multiple dictionary formats, including bedic. The scripts can handle merging of multiple part of speach into one entry with multiple "senses". You could contribute your dictionary to this project.

QUOTE
5. Strict Order of Opening/Closing Tags.  Some dictionary formats (e.g., DSL for Lingvo) allow any order of tags as long as the closing tag follows the corresponding opening tag (e.g., [ex][cl] any word[ex][cl] in DSL).  Zbedic enforces the order of closing tags depending on the order of opening tags  -- the outer (inner) opening tag should have a corresponding outer (inner) closing tag  (e.g., {ex}{de}Text{de}{ex} and not {ex}{de}Text{ex}{de}).  This is just a note for others (some of my entries did not work because of this).  I think it makes sense to enforce the order.


zbedic has a very simple parser, which may fail if the syntax is wrong. mkbedic should perform syntax check in the future.

QUOTE
7.  Use of Senses and Subsenses.  When converting dictionaries, I went the easier route of keeping the hardcoded separation of senses and subsenses (no tags, just text "1." "2." and "1)" and "2)" ).  I just put the whole thing into one sense and subsense.  The better way is to replace it with the Zbedic separation into {s} and {ss}.  But it works anyway.  Anybody sees a problem with this approach?


There should be no problem, but using "ss" tags is be recomended.

QUOTE
8.  Conversion Process.  Since I do not know any programming language, I just used the find and replace (including regular expressions) to convert the dictionaries in other formats into Zbedic format.  I know that there are some scripts available but they are specific to the format from which the conversion takes place (Wikipedia, Muller).  I would appreciate if people would share their scripts with the community here or at Zbedic SF site.  Maybe I could adapt those for my use.

9.  Making the Source Files for Dictionaries Available.  I know that dic.dz can be opened and modified but I think it would be more accessable for those who want to learn the format and/or modify the text of the dictionary files to make available on SF site regular text .dic files (with the new mkbedic the source files are pure text).


freedict has perhaps the right set of tools and can store a "source" version of the dictionary.

QUOTE
10.  New Line Break Tags.  Looking at the example.dic I see that there are new tags available for line break {br/}.  I guess this would be the only tag that does not have/need a corresponding second tag.


Yes.

QUOTE
11.  Just A Sense Without Subsenses?  Don't know why I have not tried it yet but I wonder if there can be an entry with just a simple sense (e.g., {s}meaning{/s}) without subsenses?  There are lots of words that require a simple one or two word translation and the {ss} tag seams redudant.
*


A single sense without subsenses should work.
Go to the top of the page
 
+Quote Post
kurochka
post Nov 29 2005, 06:31 PM
Post #8





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



Thanks for your responses. I am not complaining. I think zbedic is one of the greatest programs for Zaurus and you are kind enough to keep it going.
Go to the top of the page
 
+Quote Post
kurochka
post Nov 30 2005, 06:46 PM
Post #9





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



Here is another quirk that I have noticed.

If {ct}-tagged text appears within an example ({ex}), then that {ct}-text is dispayed on the first line next to the keyword (similar to how {ps}-text is displayed), instead of in the body of the example.

If there are several examples and each contains {ct} tagged text, only the last {ct}-text will be displayed next to the keyword, the others will not be displayed at all. The solution is to move the {ct} outside of {ex}.
Go to the top of the page
 
+Quote Post
rafm
post Nov 30 2005, 11:17 PM
Post #10





Group: Members
Posts: 145
Joined: 13-November 04
Member No.: 5,449



QUOTE(kurochka @ Dec 1 2005, 03:46 AM)
Here is another quirk that I have noticed.

If {ct}-tagged text appears within an example ({ex}), then that {ct}-text is dispayed on the first line next to the keyword (similar to how {ps}-text is displayed), instead of in the body of the example.

If there are several examples and each contains {ct} tagged text, only the last {ct}-text will be displayed next to the keyword, the others will not be displayed at all.  The solution is to move the {ct} outside of {ex}.
*


Category, {ct}, is usually associated with {ss}. It tells that the particular meaning of a word is only used , for example, in mathematics.
Go to the top of the page
 
+Quote Post
kurochka
post Dec 1 2005, 08:46 AM
Post #11





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



QUOTE(rafm @ Nov 30 2005, 11:17 PM)
QUOTE(kurochka @ Dec 1 2005, 03:46 AM)
Here is another quirk that I have noticed.

If {ct}-tagged text appears within an example ({ex}), then that {ct}-text is dispayed on the first line next to the keyword (similar to how {ps}-text is displayed), instead of in the body of the example.

If there are several examples and each contains {ct} tagged text, only the last {ct}-text will be displayed next to the keyword, the others will not be displayed at all.  The solution is to move the {ct} outside of {ex}.
*


Category, {ct}, is usually associated with {ss}. It tells that the particular meaning of a word is only used , for example, in mathematics.
*



I understand.

With this dictionary that I am working to convert, some examples (which are phrases that have the keyword) belong to a different {ct} category from the {ss} meaning (which may have no specific {ct}). It's like the phrase (that is given as an example {ex}) using the word is only used in politics while the word itself could be general use.

Well, there is no need for modifications to Zbedic. I will just modify the code of the dictionaries. This is something to be mindful of when making dic files.
Go to the top of the page
 
+Quote Post
kurochka
post Dec 3 2005, 07:54 AM
Post #12





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



QUOTE(rafm @ Nov 30 2005, 11:17 PM)
QUOTE(kurochka @ Dec 1 2005, 03:46 AM)
Here is another quirk that I have noticed.

If {ct}-tagged text appears within an example ({ex}), then that {ct}-text is dispayed on the first line next to the keyword (similar to how {ps}-text is displayed), instead of in the body of the example.

If there are several examples and each contains {ct} tagged text, only the last {ct}-text will be displayed next to the keyword, the others will not be displayed at all.  The solution is to move the {ct} outside of {ex}.
*


Category, {ct}, is usually associated with {ss}. It tells that the particular meaning of a word is only used , for example, in mathematics.
*



I've tested it now and it does not make sense. As indicated earlier in this thread {ct} does not behave similar to {ps} anywhere but within an example. As I said earlier one {ss} can have multiple {ct}. Therefore, I think the behaviour of {ct} within an example is anomolous. Here is an example from my upcoming English-Ukrainian dictionary:

CODE
misadventure 1
{s}{ss}{ps}n{/ps}
нещ{em}а{/em}стя, нещ{em}а{/em}сний в{em}и{/em}падок
{ex}homicide by misadventure - {ct}юр.{/ct} ненавм{em}и{/em}сне вб{em}и{/em}вство{/ex}{/ss}{/s}

misadventure 2
{s}{ss}{ps}n{/ps}
нещ{em}а{/em}стя, нещ{em}а{/em}сний в{em}и{/em}падок
{ct}юр.{/ct} {ex}homicide by misadventure - ненавм{em}и{/em}сне вб{em}и{/em}вство{/ex}{/ss}{/s}


I could not attach a compiled version for some reason. But if you compile it, you will see, in "misadventure 2" {ct} is displayed in the body of the subsense as I think it should be. In "misadventure 1" {ct} is displayed the same way as {ps} and makes the entry erroneous because only the example belongs to the legal (юр. means legal) category, while the word by itself is general use. The only difference between "misadventure 1" and "misadventure 2" is the position of {ct}: in 1 it's within the example; in 2 it's outside the example. Therefore, I advocate for modifying the way {ct} within in an example is displayed but leaving the {ct} outside an example as it is.
Go to the top of the page
 
+Quote Post
kurochka
post Dec 3 2005, 01:35 PM
Post #13





Group: Members
Posts: 303
Joined: 6-February 04
Member No.: 1,740



I have converted English-Ukrainian and Ukrainian-English dictionaries to zbedic format from the Lingvo format. Although the Lingvo files for these dictionaries are freely available, I could not trace what sort of license is attached to them. If somebody needs them, send me a pm.
Go to the top of the page
 
+Quote Post
ludo
post Dec 19 2005, 07:27 PM
Post #14





Group: Members
Posts: 53
Joined: 9-December 05
Member No.: 8,688



Hello kurochka and rafm

This is a very interesting thread. Though I am far from being at your level of knowledge and I couldn't pretend to create a dictionary, I have few questions also:

-I want to add entries to an existing dictionary: can I do it and what to do? I am not asking you a step by step detailed type of answer, but just hints to get started. I would like to add entries to the chinese-english / en-ch dictionary. And would be interested also in adding pynyin pronounciation for the chinese.

-Then I would like to make a mini specialized dictionary, japanese-en-japanese, of woodworking and wood related terms. Where to start?

Thanks for any of your help. I am with you to promote the ZBEDIC.

kurochka, I am making good progresses with your C3000!

Ludo from Taiwan
Go to the top of the page
 
+Quote Post
ludo
post Dec 19 2005, 09:03 PM
Post #15





Group: Members
Posts: 53
Joined: 9-December 05
Member No.: 8,688



I can already answer some of my own questions.
I report my findings here, in case any one would be interested, though I'm sure most of you know it already, there may be newbies (as I am) interested.

First of all, the ZBedict home page:
http://bedic.sourceforge.net/index.html

To make a dictionary:
Very good explanations are given here, with step by step at the end of the page:
http://cvs.sourceforge.net/viewcvs.py/bedi...1.5&view=markup

The feature to edit an existing dictionary will be developped int the coming future.
See news:
http://bedic.sourceforge.net/index.html#news

Now, I need to find out how to create a Japanese dictionary, that seems to be more tricky.

Any help from any one?

Ludo
Go to the top of the page
 
+Quote Post

3 Pages V   1 2 3 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 22nd October 2014 - 11:58 PM