RE: Simplified Chinese (GB2312) in Manila
Posted
Last Modified
In Response To
2/22/2002; 11:44 PM by Emmanuel. M. DecarieLast Modified
In Response To
2/22/2002; 11:44 PM by Emmanuel. M. Decarie
RE: Simplified Chinese (GB2312) in Manila (#16164)
Reply To This Message [Edit]
Hello Nobumi,
À (At) 22:24 -0500 22/02/02, Nobumi Iyanaga écrivait (wrote) :
>Read on the web at http://community.scriptmeridian.org/16164
>----------------------------------
>
>Hello Emmanuel,
>
>>
>>Well I found this page which is quite helpful:
> ><http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
>>
>>Especially this paragraph:
>>
>> >Chinese words are usually made from one to four characters; most are
>> >made from two characters.
>> >
>> >Because there are no reliable indications of word boundaries in most
>> >Chinese text, when you search you cannot use the speed up of
>> >skipping to the next word when a match fails. You have to just do a
>> >linear search.
>> >
>> >Because of the large number of characters, there are many ways of
>> >indexing text which you would not attempt in English. The most
>> >straightforward is to say "one character == one word", and to index
>> >every character (i.e., a fully permuted index.) You would never do
>> >this in English: imagine an index that found every occurance of the
>> >letter "e" in a document!
>> >
>> >You can also index on bigrams (two consecutive characters),
>> >trigrams, or even 4-grams. I read somewhere that anything more than
>> >4-grams is not much use. The cost of these is, unfortuately, that
>> >because you don't know the word boundaries, you will get lots of
>> >spurious n-grams. On the other hand, because word boundaries are
>> >fairly subjective in Chinese (like in English: some people hypenate,
>> >some don't) it is probably good to err on the side of having too
>> >many n-grams anyway.
>
>All this is very true. Indexing double-byte text is a most difficult task.
>
>>What I need is a list of these bigrams in Simplified Chinese.
>>
>>I found such a list in a Perl script called "codelib.pl" already
>>rolled in hash (life is good).
>><http://www.mandarintools.com/download/>
>>
>>There is another script on this page called "segment" that split
>>chinese text in "word" but I can't get it too work on OS X from the
>>terminal.
>
>I downloaded these scripts thanks to your info. It seems that the script
>codelib.pl does not contain a hash of Chinese word list -- it is rather a
>program that attempts to guess the encoding of double-byte text.
Yes, you are right. I wrongly assumed that Chinese Simplified was an
adaptation of cantonese for the Web. But I read more on the subject,
and I know now that Chinese Simplified came from a Mao reform in the
80s (I think) and that it contains around 17,000 ideograms.
Also, I had the wrong assumption that 2 chars == 1 word in Simplified
Chinese when one word in Simplified Chinese could be made by more
than one ideogram. See for example the word "Internet" on this
page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
there are composite words close to what in English and French we
designate by "neologism".
>On the other hand, the script segment.pl and other files contained in the
>same folder will attempt to extract words from a Simplified Chinese text.
>It contains a list of 119804 (!) Chinese words (what a work...!) (which are
>not only bigrams, but tri-grams, quadri-grams, etc.). Even if you cannot
>run the perl script itself (I didn't try it...), you could load this word
>list in a Frontier, and use it to "segment" a given Chinese text. But I
>think the task would not be easy -- and the result may not be very
>satisfactory.
Yes, this look like a difficult task. Frontier will have to load this
huge table with 119000+ items in memory and will need to go thru all
items to see if they match. I did a some tests (but without success)
with Perl on OS X and the script still need a couple of seconds to
end its run on 3 bigrams. I have to test this again, but it doesn't
look very promising.
(snip)
>Another way would be to do a linear search, as the text you quoted says.
>If your texts to be searched are not very long, I guess this would be the
>simplest way. With GB text, I think you will be able to use any Frontier
>string verbs, and even regex search. You may use also Mgrep OSAX to do
>more reliable search with Chinese text...
I don't think my text will be very long. But I'm not sure about this
"linear search" thing. Is it implying that I need to build a sort of
binary tree. Can you please provide some examples.
This look that I only need to index not bigrams but chars. Is that
right? If its the case, how accurate could be the search engine?
Thanks again Nobumi for your input.
Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
À (At) 22:24 -0500 22/02/02, Nobumi Iyanaga écrivait (wrote) :
>Read on the web at http://community.scriptmeridian.org/16164
>----------------------------------
>
>Hello Emmanuel,
>
>>
>>Well I found this page which is quite helpful:
> ><http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
>>
>>Especially this paragraph:
>>
>> >Chinese words are usually made from one to four characters; most are
>> >made from two characters.
>> >
>> >Because there are no reliable indications of word boundaries in most
>> >Chinese text, when you search you cannot use the speed up of
>> >skipping to the next word when a match fails. You have to just do a
>> >linear search.
>> >
>> >Because of the large number of characters, there are many ways of
>> >indexing text which you would not attempt in English. The most
>> >straightforward is to say "one character == one word", and to index
>> >every character (i.e., a fully permuted index.) You would never do
>> >this in English: imagine an index that found every occurance of the
>> >letter "e" in a document!
>> >
>> >You can also index on bigrams (two consecutive characters),
>> >trigrams, or even 4-grams. I read somewhere that anything more than
>> >4-grams is not much use. The cost of these is, unfortuately, that
>> >because you don't know the word boundaries, you will get lots of
>> >spurious n-grams. On the other hand, because word boundaries are
>> >fairly subjective in Chinese (like in English: some people hypenate,
>> >some don't) it is probably good to err on the side of having too
>> >many n-grams anyway.
>
>All this is very true. Indexing double-byte text is a most difficult task.
>
>>What I need is a list of these bigrams in Simplified Chinese.
>>
>>I found such a list in a Perl script called "codelib.pl" already
>>rolled in hash (life is good).
>><http://www.mandarintools.com/download/>
>>
>>There is another script on this page called "segment" that split
>>chinese text in "word" but I can't get it too work on OS X from the
>>terminal.
>
>I downloaded these scripts thanks to your info. It seems that the script
>codelib.pl does not contain a hash of Chinese word list -- it is rather a
>program that attempts to guess the encoding of double-byte text.
Yes, you are right. I wrongly assumed that Chinese Simplified was an
adaptation of cantonese for the Web. But I read more on the subject,
and I know now that Chinese Simplified came from a Mao reform in the
80s (I think) and that it contains around 17,000 ideograms.
Also, I had the wrong assumption that 2 chars == 1 word in Simplified
Chinese when one word in Simplified Chinese could be made by more
than one ideogram. See for example the word "Internet" on this
page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
there are composite words close to what in English and French we
designate by "neologism".
>On the other hand, the script segment.pl and other files contained in the
>same folder will attempt to extract words from a Simplified Chinese text.
>It contains a list of 119804 (!) Chinese words (what a work...!) (which are
>not only bigrams, but tri-grams, quadri-grams, etc.). Even if you cannot
>run the perl script itself (I didn't try it...), you could load this word
>list in a Frontier, and use it to "segment" a given Chinese text. But I
>think the task would not be easy -- and the result may not be very
>satisfactory.
Yes, this look like a difficult task. Frontier will have to load this
huge table with 119000+ items in memory and will need to go thru all
items to see if they match. I did a some tests (but without success)
with Perl on OS X and the script still need a couple of seconds to
end its run on 3 bigrams. I have to test this again, but it doesn't
look very promising.
(snip)
>Another way would be to do a linear search, as the text you quoted says.
>If your texts to be searched are not very long, I guess this would be the
>simplest way. With GB text, I think you will be able to use any Frontier
>string verbs, and even regex search. You may use also Mgrep OSAX to do
>more reliable search with Chinese text...
I don't think my text will be very long. But I'm not sure about this
"linear search" thing. Is it implying that I need to build a sort of
binary tree. Can you please provide some examples.
This look that I only need to index not bigrams but chars. Is that
right? If its the case, how accurate could be the search engine?
Thanks again Nobumi for your input.
Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
Enclosures
None.
Replies
RE: Simplified Chinese (GB2312) in Manila
2/23/2002 by Nobumi Iyanaga
Hello Emmanuel, > >Also, I had the wrong assumption that 2 chars == 1 word in Simplified >Chinese when one word in Simplified
2/23/2002 by Nobumi Iyanaga