RE: Simplified Chinese (GB2312) in Manila
Posted
Last Modified
In Response To
2/22/2002; 11:10 AM by Emmanuel. M. DecarieLast Modified
In Response To
2/22/2002; 11:10 AM by Emmanuel. M. Decarie
RE: Simplified Chinese (GB2312) in Manila (#16141)
Reply To This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16141
>----------------------------------
>
>Hello Emmanuel,
>
>>
>>Now the client want me to check if he could use the search engine.
>>
>>In theory, it should work somehow (with some patching I guess), but
>>I'm not sure. I need to make further testing. I think I need to send
>>to the indexing routine ISO-8859-1 strings instead of MacRoman (the
>>server is on Mac OS 9.0.4).
>>
>
>I don't know at all how the search engine in Frontier works, so I can only
>guess. But I think there should be no problem.
Well I found this page which is quite helpful:
<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
Especially this paragraph:
>Chinese words are usually made from one to four characters; most are
>made from two characters.
>
>Because there are no reliable indications of word boundaries in most
>Chinese text, when you search you cannot use the speed up of
>skipping to the next word when a match fails. You have to just do a
>linear search.
>
>Because of the large number of characters, there are many ways of
>indexing text which you would not attempt in English. The most
>straightforward is to say "one character == one word", and to index
>every character (i.e., a fully permuted index.) You would never do
>this in English: imagine an index that found every occurance of the
>letter "e" in a document!
>
>You can also index on bigrams (two consecutive characters),
>trigrams, or even 4-grams. I read somewhere that anything more than
>4-grams is not much use. The cost of these is, unfortuately, that
>because you don't know the word boundaries, you will get lots of
>spurious n-grams. On the other hand, because word boundaries are
>fairly subjective in Chinese (like in English: some people hypenate,
>some don't) it is probably good to err on the side of having too
>many n-grams anyway.
What I need is a list of these bigrams in Simplified Chinese.
I found such a list in a Perl script called "codelib.pl" already
rolled in hash (life is good).
<http://www.mandarintools.com/download/>
There is another script on this page called "segment" that split
chinese text in "word" but I can't get it too work on OS X from the
terminal.
Anyway, I was thinking that I could use this hash to build a
Frontier table with all bigrams for the string.multipleReplaceAll
verb and put for query string the bigram and for replace string the
same bigram surrounded by space.
I understand that can lead to errors, I'm not sure how
string.multipleReplaceAll work because I have not yet tested this
verb, I might even use a regex verb for this instead, but since its
seems impossible to have accuracy when parsing a chinese text to get
its "words", I guess this is the best way to go for now.
So once the parsing have been done on the chinese text and every
bigrams have been surrounded by space, I could hack the indexing
routine (I don't remember now where it is) so it will not check for
punctuations, quotes marks and all to determine that a character is
to be part of a word. If the hacking of the subroutine is too
difficult, I could create my own script to output the format that the
index root need.
I'll have also I guess to hack the routine that send the search
string to the search engine, split it in bigrams and allow chars that
are usually considered not part of a word.
Anyway, I'll report here if I have some luck doing this.
>As to the "compatibility" between ISO-8859-1 and MacRoman (and GB2312), I
>think there should be no problem. Because:
>GB2312 uses (I think):
>first byte ASCII decimal 161-254
>second byte ASCII decimal 161-254
>MacRoman uses all the range between 128-255 (except 202?)
>And ISO-8859-1 uses 160-255.
>
>As you seee, the range used by GB2312 is included in the range used by
>ISO-8859-1.
Yes, this what I think its happening. Thanks for the input Nobumi.
Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
>----------------------------------
>
>Hello Emmanuel,
>
>>
>>Now the client want me to check if he could use the search engine.
>>
>>In theory, it should work somehow (with some patching I guess), but
>>I'm not sure. I need to make further testing. I think I need to send
>>to the indexing routine ISO-8859-1 strings instead of MacRoman (the
>>server is on Mac OS 9.0.4).
>>
>
>I don't know at all how the search engine in Frontier works, so I can only
>guess. But I think there should be no problem.
Well I found this page which is quite helpful:
<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
Especially this paragraph:
>Chinese words are usually made from one to four characters; most are
>made from two characters.
>
>Because there are no reliable indications of word boundaries in most
>Chinese text, when you search you cannot use the speed up of
>skipping to the next word when a match fails. You have to just do a
>linear search.
>
>Because of the large number of characters, there are many ways of
>indexing text which you would not attempt in English. The most
>straightforward is to say "one character == one word", and to index
>every character (i.e., a fully permuted index.) You would never do
>this in English: imagine an index that found every occurance of the
>letter "e" in a document!
>
>You can also index on bigrams (two consecutive characters),
>trigrams, or even 4-grams. I read somewhere that anything more than
>4-grams is not much use. The cost of these is, unfortuately, that
>because you don't know the word boundaries, you will get lots of
>spurious n-grams. On the other hand, because word boundaries are
>fairly subjective in Chinese (like in English: some people hypenate,
>some don't) it is probably good to err on the side of having too
>many n-grams anyway.
What I need is a list of these bigrams in Simplified Chinese.
I found such a list in a Perl script called "codelib.pl" already
rolled in hash (life is good).
<http://www.mandarintools.com/download/>
There is another script on this page called "segment" that split
chinese text in "word" but I can't get it too work on OS X from the
terminal.
Anyway, I was thinking that I could use this hash to build a
Frontier table with all bigrams for the string.multipleReplaceAll
verb and put for query string the bigram and for replace string the
same bigram surrounded by space.
I understand that can lead to errors, I'm not sure how
string.multipleReplaceAll work because I have not yet tested this
verb, I might even use a regex verb for this instead, but since its
seems impossible to have accuracy when parsing a chinese text to get
its "words", I guess this is the best way to go for now.
So once the parsing have been done on the chinese text and every
bigrams have been surrounded by space, I could hack the indexing
routine (I don't remember now where it is) so it will not check for
punctuations, quotes marks and all to determine that a character is
to be part of a word. If the hacking of the subroutine is too
difficult, I could create my own script to output the format that the
index root need.
I'll have also I guess to hack the routine that send the search
string to the search engine, split it in bigrams and allow chars that
are usually considered not part of a word.
Anyway, I'll report here if I have some luck doing this.
>As to the "compatibility" between ISO-8859-1 and MacRoman (and GB2312), I
>think there should be no problem. Because:
>GB2312 uses (I think):
>first byte ASCII decimal 161-254
>second byte ASCII decimal 161-254
>MacRoman uses all the range between 128-255 (except 202?)
>And ISO-8859-1 uses 160-255.
>
>As you seee, the range used by GB2312 is included in the range used by
>ISO-8859-1.
Yes, this what I think its happening. Thanks for the input Nobumi.
Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
Enclosures
None.
Replies
RE: Simplified Chinese (GB2312) in Manila
2/22/2002 by Emmanuel. M. Decarie
Anyone know this book: http://www.oreilly.com/catalog/cjkvinfo/ It look like what I need but its a little bit ancient (published
2/22/2002 by Emmanuel. M. Decarie
RE: Simplified Chinese (GB2312) in Manila
2/22/2002 by Nobumi Iyanaga
Hello Emmanuel, > >Well I found this page which is quite helpful: ><http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
2/22/2002 by Nobumi Iyanaga