A community of Frontier
and Radio Users


Meridian News


Community List


Regex Project

RE: Simplified Chinese (GB2312) in Manila

Posted 
Last Modified 
In Response To 
 
2/22/2002; 11:24 PM by Nobumi Iyanaga
2/22/2002; 11:24 PM by Nobumi Iyanaga
RE: Simplified Chinese (GB2312) in Manila (#16142)
Reply To This Message [Edit]
Hello Emmanuel,

>
>Well I found this page which is quite helpful:
><http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
>
>Especially this paragraph:
>
> >Chinese words are usually made from one to four characters; most are
> >made from two characters.
> >
> >Because there are no reliable indications of word boundaries in most
> >Chinese text, when you search you cannot use the speed up of
> >skipping to the next word when a match fails. You have to just do a
> >linear search.
> >
> >Because of the large number of characters, there are many ways of
> >indexing text which you would not attempt in English. The most
> >straightforward is to say "one character == one word", and to index
> >every character (i.e., a fully permuted index.) You would never do
> >this in English: imagine an index that found every occurance of the
> >letter "e" in a document!
> >
> >You can also index on bigrams (two consecutive characters),
> >trigrams, or even 4-grams. I read somewhere that anything more than
> >4-grams is not much use. The cost of these is, unfortuately, that
> >because you don't know the word boundaries, you will get lots of
> >spurious n-grams. On the other hand, because word boundaries are
> >fairly subjective in Chinese (like in English: some people hypenate,
> >some don't) it is probably good to err on the side of having too
> >many n-grams anyway.

All this is very true. Indexing double-byte text is a most difficult task.

>What I need is a list of these bigrams in Simplified Chinese.
>
>I found such a list in a Perl script called "codelib.pl" already
>rolled in hash (life is good).
><http://www.mandarintools.com/download/>
>
>There is another script on this page called "segment" that split
>chinese text in "word" but I can't get it too work on OS X from the
>terminal.

I downloaded these scripts thanks to your info. It seems that the script
codelib.pl does not contain a hash of Chinese word list -- it is rather a
program that attempts to guess the encoding of double-byte text. It seems
to be a very good, complicated script, but I am afraid it would be of no
help for your purpose. By the way, the same thing can be done (probably
with less accuracy...?) with the command "snif text encoding" of TEC OSAX...

On the other hand, the script segment.pl and other files contained in the
same folder will attempt to extract words from a Simplified Chinese text.
It contains a list of 119804 (!) Chinese words (what a work...!) (which are
not only bigrams, but tri-grams, quadri-grams, etc.). Even if you cannot
run the perl script itself (I didn't try it...), you could load this word
list in a Frontier, and use it to "segment" a given Chinese text. But I
think the task would not be easy -- and the result may not be very
satisfactory. I think that even with this list of more than 100000 words,
the "lexikon" is not enough as soon as your texts contain some technical
words, specialized words. Anyway, if you try to work in this direction, I
would recommend to try to get the longest words first, then shorter words.
Say that if you have a sequence like "book editing", you would not separate
them in "book" and "editing", but use "book editing" as a unit...

Another way would be to do a linear search, as the text you quoted says.
If your texts to be searched are not very long, I guess this would be the
simplest way. With GB text, I think you will be able to use any Frontier
string verbs, and even regex search. You may use also Mgrep OSAX to do
more reliable search with Chinese text...

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

Enclosures


None.  

Replies







RE: Simplified Chinese (GB2312) in Manila
2/23/2002 by Emmanuel. M. Decarie
Hello Nobumi, À (At) 22:24 -0500 22/02/02, Nobumi Iyanaga écrivait (wrote) : >Read on the web at http://community.scriptmeridian.org/16164
 





RE: Simplified Chinese (GB2312) in Manila
2/23/2002 by Emmanuel. M. Decarie
Thinking more about this, and rereading this page: <http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>, I think
 





RE: Simplified Chinese (GB2312) in Manila
2/23/2002 by Emmanuel. M. Decarie
Thinking more about this, and rereading this page: <http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>, I think