A community of Frontier
and Radio Users


Meridian News


Community List


Regex Project

Simplified Chinese (GB2312) in Manila

Shown in forward chronological order.
Reverse chronological order | Hierarchical outline view

Messages: 1 - 20 of 29.
Pages: Previous 1 | 2 Next

Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/20/2002; 11:45 PM by Emmanuel. M. Decarie
2/20/2002; 11:45 PM by Emmanuel. M. Decarie
Top of Thread.
Reply to This Message [Edit]
I'm too exited to keep this to myself. It has to be tested further,
with other browsers and on other platforms, and with people that can
write Chinese, but since this could useful to other double byte
languages, here a way to create a story in Simplified Chinese
(GB2312) in a Manila site. Thanks to Keola Donaghy who gave me
judicious advices.

Here's what I did. The server run Frontier 7.0.1 on Mac OS 9.0.4.

(A) Paste
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
in Prefs->Advanced->Template in the header of the template.

(B) Set MyManilaWebsite.prefs.isoFilter to false.

(C) On the Mac, at MyManilaWebsite.["#filters"].finalFilter, add this
line to the filter:
pta^.renderedText = string.macToLatin (pta^.renderedText)

See <http://scriptdigital.com/divers/simplifiedchinese.html> for
proof of the concept.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/21/2002; 4:29 AM by Nobumi Iyanaga
2/21/2002; 4:29 AM by Nobumi Iyanaga
16120
Reply to This Message [Edit]
Hello Emmanuel,

I looked at your Chinese page. It seems very good. I think the Simplified Chinese is rather easy to handle, because it is a EUC encoding, using all "higher ASCII" characters. What would be much harder is a mixed text with Simplified Chinese and some language using "higher ASCII", like French.

Another more complicated problem is rendering Japanese or Traditional Chinese with Frontier. I worked very hard for that some years ago, at the time of Frontier 4.2.3 (I am still using it...). But I don't think this would work with more recent versions of Frontier. And I have never tried Radio...

Anyway, it seems that Frontier is net yet Unicode savvy... Sigh...!

And Phil, are you still interested in rendering Japanese text with Frontier?

Best regards,

Nobumi Iyanaga Tokyo, Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/21/2002; 9:27 AM by Jonathan Lewis
2/21/2002; 9:27 AM by Jonathan Lewis
16121
Reply to This Message [Edit]
Dear Nobumi and Emmanuel,

As you can see from http://133.14.174.35/transTest3/ it is possible to render Japanese and other languages in UTF-8 from Manila. (That site is very buggy at the moment. I have a much updated, but still by no means perfect version which I will try to install soon. You are welcome to play around if you like. But as far as enabling of multi-script writing is concerned, I have done nothing more than Emmanuel did on his site; I just set the page encoding to UTF-8 rather than Chinese).

In my experience the problem with writing mixed text in Manila messages is not a Frontier/Manila problem but a browser problem. If the page with the input form is encoded as UTF-8 then all the code that reaches Frontier is UTF-8. But on my machine at least (Mac OS10.1, IE 5.1, and previous versions of Mac0S), it's not possible to write e.g. German with umlauts and Japanese together on the same HTML form. Or at least you can write them but one script or the other is illegible. Nevertheless, even though some of the text is illegible it is, it seems, all being sent to and stored in, and then displayed by Frontier correctly. So for example just now I wrote a message in the input form on the above site, and German umlauts displayed fine but Japanese text was illegible. Then after posting the message Manila displayed it; this time the Japanese text was fine but the German was illegible. Then when I click Edit this Page the German is fine but the Japanese is illegible. I post the changes, now the German is illegible but the Japanese is fine.

No doubt I am failing to make some elementary adjustments to my browser's font settings to make it display mixed text correctly. (Perhaps I should get hold of Arial Unicode?) But my point here is that this seems to be a browser display problem and not a Frontier/Manila problem. The text itself is surviving the posting/displaying/editing process.

I am not arguing that Frontier has no problems handling Unicode text, simply that just having Japanese, or indeed mixed scripts in Manila messages is not by itself impossible.

Best wishes,

Jonathan Lewis

Tokyo Denki University

(From April 1st: Hitotsubashi University)

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/21/2002; 9:47 AM by Emmanuel. M. Decarie
2/21/2002; 9:47 AM by Emmanuel. M. Decarie
16121
Reply to This Message [Edit]
Hi Nobumi, I'm happy to hear from you again.

>I looked at your Chinese page. It seems very good. I think the
>Simplified Chinese is rather easy to handle, because it is a EUC
>encoding, using all "higher ASCII" characters.

I didn't know what was EUC encoding but found this page:
<http://cns-web.bu.edu/pub/djohnson/web_files/i18n/euc.html>

>What would be much harder is a mixed text with Simplified Chinese
>and some language using "higher ASCII", like French.

Yes, its simply not working. See my note at the bottom of the page:
<http://scriptdigital.com/divers/simplifiedchinese.html>

In my case, that's ok for the client.

>Another more complicated problem is rendering Japanese or
>Traditional Chinese with Frontier.

Well, I'm in luck for this project since I need only Simplified
Chinese to work.

Now the client want me to check if he could use the search engine.

In theory, it should work somehow (with some patching I guess), but
I'm not sure. I need to make further testing. I think I need to send
to the indexing routine ISO-8859-1 strings instead of MacRoman (the
server is on Mac OS 9.0.4).

Anyone have an idea on this?

TIA

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/21/2002; 11:41 AM by Emmanuel. M. Decarie
2/21/2002; 11:41 AM by Emmanuel. M. Decarie
16122
Reply to This Message [Edit]
Jonathan, that's very interesting. Thanks for the input.

>Read on the web at http://community.scriptmeridian.org/16122
>----------------------------------
>
>Dear Nobumi and Emmanuel,
>
>As you can see from http://133.14.174.35/transTest3/ it is possible
>to render Japanese and other languages in UTF-8 from Manila. (That
>site is very buggy at the moment. I have a much updated, but still
>by no means perfect version which I will try to install soon. You
>are welcome to play around if you like. But as far as enabling of
>multi-script writing is concerned, I have done nothing more than
>Emmanuel did on his site; I just set the page encoding to UTF-8
>rather than Chinese).
>
>In my experience the problem with writing mixed text in Manila
>messages is not a Frontier/Manila problem but a browser problem. If
>the page with the input form is encoded as UTF-8 then all the code
>that reaches Frontier is UTF-8. But on my machine at least (Mac
>OS10.1, IE 5.1, and previous versions of Mac0S), it's not possible
>to write e.g. German with umlauts and Japanese together on the same
>HTML form. Or at least you can write them but one script or the
>other is illegible. Nevertheless, even though some of the text is
>illegible it is, it seems, all being sent to and stored in, and then
>displayed by Frontier correctly. So for example just now I wrote a
>message in the input form on the above site, and German umlauts
>displayed fine but Japanese text was illegible. Then after posting
>the message Manila displayed it; this time the Japanese text was
>fine but the German was illegible. Then when I click Edit this Page
>the German is fine but the Japanese is illegible. I post the
>changes, now
>
>No doubt I am failing to make some elementary adjustments to my
>browser's font settings to make it display mixed text correctly.
>(Perhaps I should get hold of Arial Unicode?) But my point here is
>that this seems to be a browser display problem and not a
>Frontier/Manila problem. The text itself is surviving the
>posting/displaying/editing process.
>
>I am not arguing that Frontier has no problems handling Unicode
>text, simply that just having Japanese, or indeed mixed scripts in
>Manila messages is not by itself impossible.
>
>Best wishes,
>
>Jonathan Lewis
>
>Tokyo Denki University
>
>(From April 1st: Hitotsubashi University)


--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 8:58 AM by Nobumi Iyanaga
2/22/2002; 8:58 AM by Nobumi Iyanaga
16122
Reply to This Message [Edit]
Hello Jonathan,

>Read on the web at http://community.scriptmeridian.org/16122
>----------------------------------
>
>Dear Nobumi and Emmanuel,
>
>As you can see from http://133.14.174.35/transTest3/ it is possible to
>render Japanese and other languages in UTF-8 from Manila. (That site is
>very buggy at the moment. I have a much updated, but still by no means
>perfect version which I will try to install soon. You are welcome to play
>around if you like. But as far as enabling of multi-script writing is
>concerned, I have done nothing more than Emmanuel did on his site; I just
>set the page encoding to UTF-8 rather than Chinese).
>
>In my experience the problem with writing mixed text in Manila messages is
>not a Frontier/Manila problem but a browser problem. If the page with the
>input form is encoded as UTF-8 then all the code that reaches Frontier is
>UTF-8. But on my machine at least (Mac OS10.1, IE 5.1, and previous
>versions of Mac0S), it's not possible to write e.g. German with umlauts
>and Japanese together on the same HTML form.

Ah, I understand. As I am on a very old environment (OS 7.6.1; Frontier
4.2.3...; I even don't know what is Manila in Frontier...), I cannot do any
testing myself, but if you use OS 10.1, I think you should try OmniWeb. I
guess IE and Netscape are not Cocoa applications and so don't handle
Unicode properly (at least not in writing; in displaying, they can simulate
more or less...). Installing Arial Unicode would probably not fix the
problem with IE or Netscape.

I think Windows is much better in this regard. You can write mixed
multi-script text in HTML forms.

>
>No doubt I am failing to make some elementary adjustments to my browser's
>font settings to make it display mixed text correctly. (Perhaps I should
>get hold of Arial Unicode?) But my point here is that this seems to be a
>browser display problem and not a Frontier/Manila problem. The text itself
>is surviving the posting/displaying/editing process.

How do you render UTF-8 text in Frontier...? As Frontier is not Unicode
savvy, I think all text in UTF-8 is garbled in Frontier's windows...? Or do
you convert the text from legacy codes to UTF-8 on the fly?

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 8:58 AM by Nobumi Iyanaga
2/22/2002; 8:58 AM by Nobumi Iyanaga
16123
Reply to This Message [Edit]
Hello Emmanuel,

>
>Now the client want me to check if he could use the search engine.
>
>In theory, it should work somehow (with some patching I guess), but
>I'm not sure. I need to make further testing. I think I need to send
>to the indexing routine ISO-8859-1 strings instead of MacRoman (the
>server is on Mac OS 9.0.4).
>

I don't know at all how the search engine in Frontier works, so I can only
guess. But I think there should be no problem.

As to the "compatibility" between ISO-8859-1 and MacRoman (and GB2312), I
think there should be no problem. Because:
GB2312 uses (I think):
first byte ASCII decimal 161-254
second byte ASCII decimal 161-254
MacRoman uses all the range between 128-255 (except 202?)
And ISO-8859-1 uses 160-255.

As you seee, the range used by GB2312 is included in the range used by
ISO-8859-1.

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 10:10 AM by Emmanuel. M. Decarie
2/22/2002; 10:10 AM by Emmanuel. M. Decarie
16141
Reply to This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16141
>----------------------------------
>
>Hello Emmanuel,
>
>>
>>Now the client want me to check if he could use the search engine.
>>
>>In theory, it should work somehow (with some patching I guess), but
>>I'm not sure. I need to make further testing. I think I need to send
>>to the indexing routine ISO-8859-1 strings instead of MacRoman (the
>>server is on Mac OS 9.0.4).
>>
>
>I don't know at all how the search engine in Frontier works, so I can only
>guess. But I think there should be no problem.

Well I found this page which is quite helpful:
<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>

Especially this paragraph:

>Chinese words are usually made from one to four characters; most are
>made from two characters.
>
>Because there are no reliable indications of word boundaries in most
>Chinese text, when you search you cannot use the speed up of
>skipping to the next word when a match fails. You have to just do a
>linear search.
>
>Because of the large number of characters, there are many ways of
>indexing text which you would not attempt in English. The most
>straightforward is to say "one character == one word", and to index
>every character (i.e., a fully permuted index.) You would never do
>this in English: imagine an index that found every occurance of the
>letter "e" in a document!
>
>You can also index on bigrams (two consecutive characters),
>trigrams, or even 4-grams. I read somewhere that anything more than
>4-grams is not much use. The cost of these is, unfortuately, that
>because you don't know the word boundaries, you will get lots of
>spurious n-grams. On the other hand, because word boundaries are
>fairly subjective in Chinese (like in English: some people hypenate,
>some don't) it is probably good to err on the side of having too
>many n-grams anyway.

What I need is a list of these bigrams in Simplified Chinese.

I found such a list in a Perl script called "codelib.pl" already
rolled in hash (life is good).
<http://www.mandarintools.com/download/>

There is another script on this page called "segment" that split
chinese text in "word" but I can't get it too work on OS X from the
terminal.

Anyway, I was thinking that I could use this hash to build a
Frontier table with all bigrams for the string.multipleReplaceAll
verb and put for query string the bigram and for replace string the
same bigram surrounded by space.

I understand that can lead to errors, I'm not sure how
string.multipleReplaceAll work because I have not yet tested this
verb, I might even use a regex verb for this instead, but since its
seems impossible to have accuracy when parsing a chinese text to get
its "words", I guess this is the best way to go for now.

So once the parsing have been done on the chinese text and every
bigrams have been surrounded by space, I could hack the indexing
routine (I don't remember now where it is) so it will not check for
punctuations, quotes marks and all to determine that a character is
to be part of a word. If the hacking of the subroutine is too
difficult, I could create my own script to output the format that the
index root need.

I'll have also I guess to hack the routine that send the search
string to the search engine, split it in bigrams and allow chars that
are usually considered not part of a word.

Anyway, I'll report here if I have some luck doing this.


>As to the "compatibility" between ISO-8859-1 and MacRoman (and GB2312), I
>think there should be no problem. Because:
>GB2312 uses (I think):
>first byte ASCII decimal 161-254
>second byte ASCII decimal 161-254
>MacRoman uses all the range between 128-255 (except 202?)
>And ISO-8859-1 uses 160-255.
>
>As you seee, the range used by GB2312 is included in the range used by
>ISO-8859-1.

Yes, this what I think its happening. Thanks for the input Nobumi.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 12:45 PM by Samuel Reynolds
2/22/2002; 12:45 PM by Samuel Reynolds
16140
Reply to This Message [Edit]
>How do you render UTF-8 text in Frontier...? As Frontier is not Unicode
>savvy, I think all text in UTF-8 is garbled in Frontier's windows...? Or do
>you convert the text from legacy codes to UTF-8 on the fly?

UTF-8 is the 8-bit subset of unicode that corresponds to
the ISO-8859 (Latin-1) "extended ASCII" character set that
is the Windows default set. Character 0xnn in UTF-8 is
always character 0x00nn in unicode.

- Sam
- - - - - - - - - - - - - - - - - - - - - - - - - -
I'm currently looking for a new position/contract.
Resume at http://spinwardstars.com/vitae/
- - - - - - - - - - - - - - - - - - - - - - - - - -
_____________________________________________
Samuel Reynolds sam@spinwardstars.com
Spinward Stars: http://www.spinwardstars.com/

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 2:13 PM by Emmanuel. M. Decarie
2/22/2002; 2:13 PM by Emmanuel. M. Decarie
16142
Reply to This Message [Edit]
Anyone know this book:
http://www.oreilly.com/catalog/cjkvinfo/

It look like what I need but its a little bit ancient (published in 1998).

TIA
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 7:13 PM by Brian Andresen
2/22/2002; 7:13 PM by Brian Andresen
16145
Reply to This Message [Edit]
On 2/22/2002 9:45 AM, Samuel Reynolds <sam@spinwardstars.com> wrote:

>UTF-8 is the 8-bit subset of unicode that corresponds to
>the ISO-8859 (Latin-1) "extended ASCII" character set that
>is the Windows default set. Character 0xnn in UTF-8 is
>always character 0x00nn in unicode.

UTF-8 only preserves the ASCII character set. You may be thinking of
ISO-8859-1, in that Unicode chars U+0000 to U+00FF are exactly the
characters in ISO-8859-1. Win1252 is very similar to ISO-8859-1 but not
identical.

http://www.ietf.org/rfc/rfc2279.txt
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

-Brian

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 10:24 PM by Nobumi Iyanaga
2/22/2002; 10:24 PM by Nobumi Iyanaga
16142
Reply to This Message [Edit]
Hello Emmanuel,

>
>Well I found this page which is quite helpful:
><http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
>
>Especially this paragraph:
>
> >Chinese words are usually made from one to four characters; most are
> >made from two characters.
> >
> >Because there are no reliable indications of word boundaries in most
> >Chinese text, when you search you cannot use the speed up of
> >skipping to the next word when a match fails. You have to just do a
> >linear search.
> >
> >Because of the large number of characters, there are many ways of
> >indexing text which you would not attempt in English. The most
> >straightforward is to say "one character == one word", and to index
> >every character (i.e., a fully permuted index.) You would never do
> >this in English: imagine an index that found every occurance of the
> >letter "e" in a document!
> >
> >You can also index on bigrams (two consecutive characters),
> >trigrams, or even 4-grams. I read somewhere that anything more than
> >4-grams is not much use. The cost of these is, unfortuately, that
> >because you don't know the word boundaries, you will get lots of
> >spurious n-grams. On the other hand, because word boundaries are
> >fairly subjective in Chinese (like in English: some people hypenate,
> >some don't) it is probably good to err on the side of having too
> >many n-grams anyway.

All this is very true. Indexing double-byte text is a most difficult task.

>What I need is a list of these bigrams in Simplified Chinese.
>
>I found such a list in a Perl script called "codelib.pl" already
>rolled in hash (life is good).
><http://www.mandarintools.com/download/>
>
>There is another script on this page called "segment" that split
>chinese text in "word" but I can't get it too work on OS X from the
>terminal.

I downloaded these scripts thanks to your info. It seems that the script
codelib.pl does not contain a hash of Chinese word list -- it is rather a
program that attempts to guess the encoding of double-byte text. It seems
to be a very good, complicated script, but I am afraid it would be of no
help for your purpose. By the way, the same thing can be done (probably
with less accuracy...?) with the command "snif text encoding" of TEC OSAX...

On the other hand, the script segment.pl and other files contained in the
same folder will attempt to extract words from a Simplified Chinese text.
It contains a list of 119804 (!) Chinese words (what a work...!) (which are
not only bigrams, but tri-grams, quadri-grams, etc.). Even if you cannot
run the perl script itself (I didn't try it...), you could load this word
list in a Frontier, and use it to "segment" a given Chinese text. But I
think the task would not be easy -- and the result may not be very
satisfactory. I think that even with this list of more than 100000 words,
the "lexikon" is not enough as soon as your texts contain some technical
words, specialized words. Anyway, if you try to work in this direction, I
would recommend to try to get the longest words first, then shorter words.
Say that if you have a sequence like "book editing", you would not separate
them in "book" and "editing", but use "book editing" as a unit...

Another way would be to do a linear search, as the text you quoted says.
If your texts to be searched are not very long, I guess this would be the
simplest way. With GB text, I think you will be able to use any Frontier
string verbs, and even regex search. You may use also Mgrep OSAX to do
more reliable search with Chinese text...

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 11:03 PM by Nobumi Iyanaga
2/22/2002; 11:03 PM by Nobumi Iyanaga
16153
Reply to This Message [Edit]
Hello Emmanuel,

>Read on the web at http://community.scriptmeridian.org/16153
>----------------------------------
>
>Anyone know this book:
>http://www.oreilly.com/catalog/cjkvinfo/
>
>It look like what I need but its a little bit ancient (published in 1998).
>

I have this book. It IS certainly the best work on this area, and for your
need, I think it is not ancient at all.

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 11:03 PM by Nobumi Iyanaga
2/22/2002; 11:03 PM by Nobumi Iyanaga
16162
Reply to This Message [Edit]
Hello Samuel and Brian,

>Read on the web at http://community.scriptmeridian.org/16162
>----------------------------------
>
>On 2/22/2002 9:45 AM, Samuel Reynolds <sam@spinwardstars.com> wrote:
>
> >UTF-8 is the 8-bit subset of unicode that corresponds to
> >the ISO-8859 (Latin-1) "extended ASCII" character set that
> >is the Windows default set. Character 0xnn in UTF-8 is
> >always character 0x00nn in unicode.
>
>UTF-8 only preserves the ASCII character set. You may be thinking of
>ISO-8859-1, in that Unicode chars U+0000 to U+00FF are exactly the
>characters in ISO-8859-1. Win1252 is very similar to ISO-8859-1 but not
>identical.
>
>http://www.ietf.org/rfc/rfc2279.txt
>http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
>http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>

Yes, as long as ASCII characters are concerned, there is no problem. But
with accented characters, double-byte characters, all the text is garbled...

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 11:44 PM by Emmanuel. M. Decarie
2/22/2002; 11:44 PM by Emmanuel. M. Decarie
16164
Reply to This Message [Edit]
Hello Nobumi,

À (At) 22:24 -0500 22/02/02, Nobumi Iyanaga écrivait (wrote) :
>Read on the web at http://community.scriptmeridian.org/16164
>----------------------------------
>
>Hello Emmanuel,
>
>>
>>Well I found this page which is quite helpful:
> ><http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
>>
>>Especially this paragraph:
>>
>> >Chinese words are usually made from one to four characters; most are
>> >made from two characters.
>> >
>> >Because there are no reliable indications of word boundaries in most
>> >Chinese text, when you search you cannot use the speed up of
>> >skipping to the next word when a match fails. You have to just do a
>> >linear search.
>> >
>> >Because of the large number of characters, there are many ways of
>> >indexing text which you would not attempt in English. The most
>> >straightforward is to say "one character == one word", and to index
>> >every character (i.e., a fully permuted index.) You would never do
>> >this in English: imagine an index that found every occurance of the
>> >letter "e" in a document!
>> >
>> >You can also index on bigrams (two consecutive characters),
>> >trigrams, or even 4-grams. I read somewhere that anything more than
>> >4-grams is not much use. The cost of these is, unfortuately, that
>> >because you don't know the word boundaries, you will get lots of
>> >spurious n-grams. On the other hand, because word boundaries are
>> >fairly subjective in Chinese (like in English: some people hypenate,
>> >some don't) it is probably good to err on the side of having too
>> >many n-grams anyway.
>
>All this is very true. Indexing double-byte text is a most difficult task.
>
>>What I need is a list of these bigrams in Simplified Chinese.
>>
>>I found such a list in a Perl script called "codelib.pl" already
>>rolled in hash (life is good).
>><http://www.mandarintools.com/download/>
>>
>>There is another script on this page called "segment" that split
>>chinese text in "word" but I can't get it too work on OS X from the
>>terminal.
>
>I downloaded these scripts thanks to your info. It seems that the script
>codelib.pl does not contain a hash of Chinese word list -- it is rather a
>program that attempts to guess the encoding of double-byte text.

Yes, you are right. I wrongly assumed that Chinese Simplified was an
adaptation of cantonese for the Web. But I read more on the subject,
and I know now that Chinese Simplified came from a Mao reform in the
80s (I think) and that it contains around 17,000 ideograms.

Also, I had the wrong assumption that 2 chars == 1 word in Simplified
Chinese when one word in Simplified Chinese could be made by more
than one ideogram. See for example the word "Internet" on this
page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
there are composite words close to what in English and French we
designate by "neologism".


>On the other hand, the script segment.pl and other files contained in the
>same folder will attempt to extract words from a Simplified Chinese text.
>It contains a list of 119804 (!) Chinese words (what a work...!) (which are
>not only bigrams, but tri-grams, quadri-grams, etc.). Even if you cannot
>run the perl script itself (I didn't try it...), you could load this word
>list in a Frontier, and use it to "segment" a given Chinese text. But I
>think the task would not be easy -- and the result may not be very
>satisfactory.

Yes, this look like a difficult task. Frontier will have to load this
huge table with 119000+ items in memory and will need to go thru all
items to see if they match. I did a some tests (but without success)
with Perl on OS X and the script still need a couple of seconds to
end its run on 3 bigrams. I have to test this again, but it doesn't
look very promising.

(snip)


>Another way would be to do a linear search, as the text you quoted says.
>If your texts to be searched are not very long, I guess this would be the
>simplest way. With GB text, I think you will be able to use any Frontier
>string verbs, and even regex search. You may use also Mgrep OSAX to do
>more reliable search with Chinese text...

I don't think my text will be very long. But I'm not sure about this
"linear search" thing. Is it implying that I need to build a sort of
binary tree. Can you please provide some examples.

This look that I only need to index not bigrams but chars. Is that
right? If its the case, how accurate could be the search engine?

Thanks again Nobumi for your input.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 11:44 PM by Emmanuel. M. Decarie
2/22/2002; 11:44 PM by Emmanuel. M. Decarie
16165
Reply to This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16165
>----------------------------------
>
>Hello Emmanuel,
>
>>Read on the web at http://community.scriptmeridian.org/16153
>>----------------------------------
>>
>>Anyone know this book:
>>http://www.oreilly.com/catalog/cjkvinfo/
>>
>>It look like what I need but its a little bit ancient (published in 1998).
>>
>
>I have this book. It IS certainly the best work on this area, and for your
>need, I think it is not ancient at all.

Thanks Nobumi. I went to the bookstore today and the book look pretty
good for what I need to understand and implement (but its a little
bit expensive). I think I will buy it soon.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/22/2002; 11:58 PM by Emmanuel. M. Decarie
2/22/2002; 11:58 PM by Emmanuel. M. Decarie
16164
Reply to This Message [Edit]
Thinking more about this, and rereading this page:
<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>, I think
I'll follow the advice from the author:

>I personally feel that the answer is neither dictionaries nor
>parsing but markup. When entering the data, there should either be
>markup of all proper nouns or markup of all word boundaries.
>
>The last is simplest: when typing in Chinese, put in a space between
>words! That system works well in the West, keyboards already have
>spacebars, and it is simple enough for people to do when they have
>the habit. The spaces should be ignored by the publication system:
>they should be treated as "zero-width spaces".

I could tell the users that if it want its chinese text to be
indexed, he need to split chinese word with space. I think that if I
start from such a text to implement indexing and searching, its going
to be much more simpler.

What do you think Nobumi (or others interested in the discussion),
does it make sense for you?

Cheers
-Emmanuel



--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 12:03 AM by Emmanuel. M. Decarie
2/23/2002; 12:03 AM by Emmanuel. M. Decarie
16164
Reply to This Message [Edit]
Thinking more about this, and rereading this page:
<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>, I think
I'll follow the advice from the author:

>I personally feel that the answer is neither dictionaries nor
>parsing but markup. When entering the data, there should either be
>markup of all proper nouns or markup of all word boundaries.
>
>The last is simplest: when typing in Chinese, put in a space between
>words! That system works well in the West, keyboards already have
>spacebars, and it is simple enough for people to do when they have
>the habit. The spaces should be ignored by the publication system:
>they should be treated as "zero-width spaces".

I could tell the users that if it want its chinese text to be
indexed, he need to split chinese word with space. I think that if I
start from such a text to implement indexing and searching, its going
to be much more simpler.

What do you think Nobumi (or others interested in the discussion),
does it make sense for you?

Cheers
-Emmanuel

--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 1:58 AM by jt
2/23/2002; 1:58 AM by jt
16170
Reply to This Message [Edit]
I'm just catching up, and haven't followed all the links. But it seems like
you have a winnah, Emmanuel...!

It would require an adjustment from the writer, which I don't know if
they'll go for... But IF they want the text to be indexed and searched (and
I'd think they would in many cases), then I think they COULD make this
adjustment.

I don't know the culture well enough to imagine whether they WOULD, but
should be easy enough to test on a small scale.


| -----Original Message-----
| From: sm.community@lists.scriptmeridian.org
| [mailto:sm.community@lists.scriptmeridian.org]On Behalf Of Emmanuel. M.
| Decarie
| Sent: Saturday, February 23, 2002 12:04 AM
| To: sm.community@lists.scriptmeridian.org
| Subject: RE: [SM] Simplified Chinese (GB2312) in Manila [Msg#16170]
|
|
| Read on the web at http://community.scriptmeridian.org/16170
| ----------------------------------

<snip>

| I could tell the users that if it want its chinese text to be
| indexed, he need to split chinese word with space. I think that if I
| start from such a text to implement indexing and searching, its going
| to be much more simpler.
|
| What do you think Nobumi (or others interested in the discussion),
| does it make sense for you?
|
| Cheers
| -Emmanuel
|
| --
| ______________________________________________________________________
| Emmanuel Decarie / Programmation pour le Web - Programming for the Web
| Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
|
|
|

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 8:01 AM by Nobumi Iyanaga
2/23/2002; 8:01 AM by Nobumi Iyanaga
16167
Reply to This Message [Edit]
Hello Emmanuel,

>
>Also, I had the wrong assumption that 2 chars == 1 word in Simplified
>Chinese when one word in Simplified Chinese could be made by more
>than one ideogram. See for example the word "Internet" on this
>page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
>there are composite words close to what in English and French we
>designate by "neologism".

Definition of a "word" in Chinese is difficult (although I don't know the
modern Chinese at all...). Each character has usually a meaning; and
compositions of several characters have some meanings.

>
> >Another way would be to do a linear search, as the text you quoted says.
> >If your texts to be searched are not very long, I guess this would be the
> >simplest way. With GB text, I think you will be able to use any Frontier
> >string verbs, and even regex search. You may use also Mgrep OSAX to do
> >more reliable search with Chinese text...
>
>I don't think my text will be very long. But I'm not sure about this
>"linear search" thing. Is it implying that I need to build a sort of
>binary tree. Can you please provide some examples.
>
>This look that I only need to index not bigrams but chars. Is that
>right? If its the case, how accurate could be the search engine?
>

By "linear search" I mean simply the basic search of text like the one that
one finds in every word-processing program. Say that I search for the word
"program" in the last sentence, I would do...:

on search (targetWord, str)
local (len, pos)
pos = string.patternMatch (targetWord, str)
if pos != 0
len = string.length (targetWord)
return ({pos, pos + len})
else
return (false)

local (str = "By linear search I mean only the basic search of text like
the one that one finds in every word-processing program.")
local (targetWord = "program")
print (search (targetWord, str))

which returns {108, 115}

Perhaps this is too simple to be used as a search engine...??

---------

In another posting, you wrote:

> >The last is simplest: when typing in Chinese, put in a space between
> >words! That system works well in the West, keyboards already have
> > spacebars, and it is simple enough for people to do when they have the
> >habit. The spaces should be ignored by the publication system: they
> >should be treated as "zero-width spaces".

> I could tell the users that if it want its chinese text to be indexed,
> he need to split chinese word with space. I think that if I start from
> such a text to implement indexing and searching, its going to be much
> more simpler.

I think this is a good idea. But I imagine it would not be very simple to
type Chinese text separating each word with a space, because in general, in
Chinese or Japanese input methods, the space bar is used to trigger the
conversion of the inputted pronunciation into Chinese or Japanese
character(s). For example, you type "f-a-n-g", then you press on the space
bar, and several candidates of characters pronunced "fang" appear in a
little list box, etc. Of course, you can type spaces in a Chinese or
Japanese text, but this requires another step (for example, pressing the
Caps Lock key), which is not naturel for Japanese or Chinese typists.

Good luck, and best regards,

Nobumi Iyanaga
Tokyo,
Japan

Messages: 29.
Pages: Previous 1 | 2 Next