A community of Frontier
and Radio Users


Meridian News


Community List


Regex Project

Simplified Chinese (GB2312) in Manila

Shown in reverse chronological order.
Forward chronological order | Hierarchical outline view

Messages: 1 - 15 of 29.
Pages: Previous 1 | 2 Next

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
3/1/2002; 12:01 AM by Emmanuel. M. Decarie
3/1/2002; 12:01 AM by Emmanuel. M. Decarie
16228
Reply to This Message [Edit]
À (At) 17:33 -0500 28/02/02, Phil Suh écrivait (wrote) :
>WITH REGARD TO SPACING IN CHINESE
>
>Emmanuel, the strategy of asking your users to add spaces between words is
>technically feasible but I think culturally misplaced. Written Japanese
>and Chinese don't use spaces between words. Asking your users to add
>spaces is not likely to work. It's similar to asking English or French
>writers to write *without* spaces. Just not done.

Hello Phil,

Thanks for your great input.

This idea of adding space didn't come from me but from this web page
(<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>) which
is a faq on how to process Chinese text. But Nobumi said the same
thing that you said about space between word, so I think that this
solution is not really a solution.

I think that I will maybe run a web service on a FreeBSD or Linux
machine and let Perl index these pages.

There is a nice callback at
config.mainresponder.callbacks.storePageForIndexing. From there you
can decide if you want Frontier to index the page or not, and from
there, you get tons of info on the page (the text, the url, the name
of the site and so on..). So it could be trivial I think to send via
xml-rpc the content of the page on another server running Perl for
indexing.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/28/2002; 6:33 PM by Phil Suh
2/28/2002; 6:33 PM by Phil Suh
16121
Reply to This Message [Edit]
On Thu, 21 Feb 2002, Nobumi Iyanaga wrote:
> And Phil, are you still interested in rendering Japanese text with
> Frontier?

Honto ni hisashiburi desu ne. Wow, it's good to hear from you, Nobumi.
Like old times.

I'm still interested in rendering Japanese text with Frontier, and now
Radio. But I'm afraid that there are still major limitations, which
Emmanuel, Daniel, et al are running into.


WHAT WORKS

Frontier and Radio will take whatever you give it and store it in the ODB
(object database). So if you hack the html form templates correctly so
that the browser sends the correctly encoded text, Frontier/Radio will
blissfully store it in the correct manner. And again, as Frontier/Radio
pulls a message out of the ODB for display, it does not touch it, and it
will work fine.


WHAT DOESNT WORK

The problem comes when you attempt to manipulate that text--either in a
regex, or with one of the builtins.string verbs.

Since Frontier is not Unicode savvy, it will wind up garbling the text.
In Japanese this is called mojibake. In English, sadness and despair.


SUMMARY

You can store and retrieve text, with a little juggling. Anything
interesting, however (searching, regex, any sort of string manipulation,
running text through macros or the renderer) will not work. I'm thinking
primarily of *Japanese text* here, I don't have experience with Simplified
Chinese.


WITH REGARD TO SPACING IN CHINESE

Emmanuel, the strategy of asking your users to add spaces between words is
technically feasible but I think culturally misplaced. Written Japanese
and Chinese don't use spaces between words. Asking your users to add
spaces is not likely to work. It's similar to asking English or French
writers to write *without* spaces. Just not done.


THE BOOK

Ah yes, Ken Lund's CKJV Information Processing is a work of art. It's one
of the few computer books on my shelf that makes me smile when I pick it
up. "Everything I'll ever need to know about this topic is in my hands." A
very satisfying feeling. This kind of info does not go oout of date
quickly--my book is a first printing, January 1999.


MY OLD, BROKEN, OUT OF DATE SITE

http://filsa.net/frontier/polyglot/

Has some discussions about Japanese in Frontier from, geez, ages ago.


USERLAND AND UNICODE

Userland's COO Jonh Robb wrote me last year to ask what the status of
Unicode in Frontier was (he saw my polyglot site). I wrote a long
response, which, because it is informative, will forward to this list.

I can understand why Userland has yet to put Unicode support into
Frontier/Radio. It's expensive. And somewhat risky--it's messing around in
the kernel. It's a lot of developer time in the trenches on a not-so-sexy
feature.

OTOH, I think it's a necessary and *practical* feature--and it's also the
way of the world. Every app should, IMHO, support all the world's
languages, because 1) there are supportable standards, 2) it's technically
possible, 3) the world is a smaller place, and 4) the English is only 1 of
the worlds 4 major langauge groups (Hindi, Mandarin, and Spanish)... but
I'm ranting.

Cheers,

Phil

(just got caught up on this thread--and this thread only. Man you guys are
talky.)

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/26/2002; 2:45 PM by Emmanuel. M. Decarie
2/26/2002; 2:45 PM by Emmanuel. M. Decarie
16211
Reply to This Message [Edit]
Hello Henri,

This sound very interesting.

Could you put the Perl and C codes on a web page so other interested
could download the codes?

I don't have a enough C knowledge, but if someone could turn the C
code in a dll, that could solve a lot of problems for double byte
languages.

>Read on the web at http://community.scriptmeridian.org/16211
>----------------------------------
>
>Sorry to chime in so late, and slightly off topic, but I have a couple
>of pieces of code that might be of interest:
>
>First, I have developed a fairly optimized piece of Perl code that will
>search a full web page (or any long text string) and find all
>occurrences of words in a predefined list.
>By "fairly optimized" I mean that in 10 milliseconds it can find all
>occurrences of 10,000 words in a 100k web page (translating into 50k of
>real text). The system is set up as a web service in Apache/mod_perl,
>where you predefine a list of words and then call it via http with the
>web page you want to parse. It's completely scalable to billions of
>searches a day.
>
>The problem with it is that it uses word boundaries, which is no good to
>you in the case of Chinese it seems.
>
>However, I also have a VERY optimized piece of C code written by one of
>my engineers that streams text and finds all occurrences of a list of
>words in that text.

In what format is this list of words? If its a text file, it could be
easily loaded in a GDB in Frontier so the dll could use this GDB for
a search (I have 119000 words in a CJK words list). Does it sound
doable? Is a table with 119000+ items is too wide to be efficient in
Frontier?

>The difference with the above is not only that it
>does it in less than 1 millisecond, but it doesn't need word boundaries
>as it checks each character as it comes in for a match. This would
>probably be the solution for you.

Yes, this look very promising because it doesn't oblige the user to
change the way he/she write.

>You may want to take a look at the code and see if you can replicate it
>in Usertalk or perl or whatever, it's a very small but extremely
>efficient algorithm.

Thanks Henri for your input.

Cheers
-Emmanuel


--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

Re: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/26/2002; 2:04 PM by Henri Asseily
2/26/2002; 2:04 PM by Henri Asseily
16181
Reply to This Message [Edit]
Sorry to chime in so late, and slightly off topic, but I have a couple
of pieces of code that might be of interest:

First, I have developed a fairly optimized piece of Perl code that will
search a full web page (or any long text string) and find all
occurrences of words in a predefined list.
By "fairly optimized" I mean that in 10 milliseconds it can find all
occurrences of 10,000 words in a 100k web page (translating into 50k of
real text). The system is set up as a web service in Apache/mod_perl,
where you predefine a list of words and then call it via http with the
web page you want to parse. It's completely scalable to billions of
searches a day.

The problem with it is that it uses word boundaries, which is no good to
you in the case of Chinese it seems.

However, I also have a VERY optimized piece of C code written by one of
my engineers that streams text and finds all occurrences of a list of
words in that text. The difference with the above is not only that it
does it in less than 1 millisecond, but it doesn't need word boundaries
as it checks each character as it comes in for a match. This would
probably be the solution for you.
You may want to take a look at the code and see if you can replicate it
in Usertalk or perl or whatever, it's a very small but extremely
efficient algorithm.

Let me know.

Henri.



>>> I don't think my text will be very long. But I'm not sure about this
>>> "linear search" thing. Is it implying that I need to build a sort of
>>> binary tree. Can you please provide some examples.
>>>
>>> This look that I only need to index not bigrams but chars. Is that
>>> right? If its the case, how accurate could be the search engine?
>>>
>>
>> By "linear search" I mean simply the basic search of text like the one
>> that
>> one finds in every word-processing program. Say that I search for the
>> word
>> "program" in the last sentence, I would do...:
>>
>> on search (targetWord, str)
>> local (len, pos)
>> pos = string.patternMatch (targetWord, str)
>> if pos != 0
>> len = string.length (targetWord)
>> return ({pos, pos + len})
>> else
>> return (false)
>>
>> local (str = "By linear search I mean only the basic search of text
>> like
>> the one that one finds in every word-processing program.")
>> local (targetWord = "program")
>> print (search (targetWord, str))
>>
>> which returns {108, 115}
>>
>> Perhaps this is too simple to be used as a search engine...??
>
> Ok, I understand now. About your question, it might work. But I think
> I will going to ask the chinese user to put a space between each
> chinese words. This will eliminate a lot of overhead I think.
>
>> ---------
>>
>> In another posting, you wrote:
>>
>>>> The last is simplest: when typing in Chinese, put in a space between
>>>> words! That system works well in the West, keyboards already have
>>>> spacebars, and it is simple enough for people to do when they have
>>>> the
>>>> habit. The spaces should be ignored by the publication system: they
>>>> should be treated as "zero-width spaces".
>>
>>> I could tell the users that if it want its chinese text to be
>>> indexed,
>>> he need to split chinese word with space. I think that if I start
>>> from
>>> such a text to implement indexing and searching, its going to be much
>>> more simpler.
>>
>> I think this is a good idea. But I imagine it would not be very
>> simple to
>> type Chinese text separating each word with a space, because in
>> general, in
>> Chinese or Japanese input methods, the space bar is used to trigger the
>> conversion of the inputted pronunciation into Chinese or Japanese
>> character(s). For example, you type "f-a-n-g", then you press on the
>> space
>> bar, and several candidates of characters pronunced "fang" appear in a
>> little list box, etc. Of course, you can type spaces in a Chinese or
>> Japanese text, but this requires another step (for example, pressing
>> the
>> Caps Lock key), which is not naturel for Japanese or Chinese typists.
>
> Oh, I see. Can you suggest a better markup than space that could be
> more convenient for the Chinese user/Japanese user?
>
> Thanks again Nobumi for your input, this help me tremendously, and I
> like the whole challenge.
>
> Cheers
> -Emmanuel
> --
> ______________________________________________________________________
> Emmanuel Decarie / Programmation pour le Web - Programming for the Web
> Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
>
>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/24/2002; 10:16 PM by Jonathan Lewis
2/24/2002; 10:16 PM by Jonathan Lewis
16140
Reply to This Message [Edit]
Dear Nobumi,

Sorry for the long delay in replying. Thanks for the tip about browsers. I'll take a look at Omniweb and, gulp, Windows browsers.

How do you render UTF-8 text in Frontier...? As Frontier is not Unicode savvy, I think all text in UTF-8 is garbled in Frontier's windows...? Or do you convert the text from legacy codes to UTF-8 on the fly?

As you say, all UTF-8 text is garbled in Frontier. With message contents, subjects etc., everything can easily be edited in the browser because this is Manila. So I never need to edit those directly in Frontier. The fact that the server is in deepest Saitama, while I'm usually working on the site here in Daiba, is another reason why I don't mind that I can't edit text directly in Frontier.

I have also produced very rudimentary Japanese and (thanks to a friend from China) Simplified Chinese versions of Manila's localization tables, which display page furniture such as the Edit this Page button in the user's language. With those, I input the text directly into Frontier, then run a little script to change the text into UTF-8 using the TEC extension. If I want to edit those strings, I edit the originals then reconvert them to UTF-8.

Best,

Jonathan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/24/2002; 2:57 PM by Emmanuel. M. Decarie
2/24/2002; 2:57 PM by Emmanuel. M. Decarie
16185
Reply to This Message [Edit]
Hello Nobumi,

>Read on the web at http://community.scriptmeridian.org/16185
>----------------------------------

(...)

> >Oh, I see. Can you suggest a better markup than space that could be
>>more convenient for the Chinese user/Japanese user?
>>
>
>I think this depends on the Chinese Input Method used by your users, and
>their habit. Like any markup system, this would be certainly not very
>naturel and require some extra effort. The best would be to ask your users
>what they would prefer.

I'm a little bit scare of letting the user to choose a itself a
markup because this could lead to a lot of programming headaches.

Maybe the choice of markup could be limited to 3 or 4 type of markup.
I need to think a little bit more about this.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/24/2002; 8:45 AM by Nobumi Iyanaga
2/24/2002; 8:45 AM by Nobumi Iyanaga
16181
Reply to This Message [Edit]
Hello Emmanuel,

> >
> >I think this is a good idea. But I imagine it would not be very simple to
> >type Chinese text separating each word with a space, because in general, in
> >Chinese or Japanese input methods, the space bar is used to trigger the
> >conversion of the inputted pronunciation into Chinese or Japanese
> >character(s). For example, you type "f-a-n-g", then you press on the space
> >bar, and several candidates of characters pronunced "fang" appear in a
> >little list box, etc. Of course, you can type spaces in a Chinese or
> >Japanese text, but this requires another step (for example, pressing the
> >Caps Lock key), which is not naturel for Japanese or Chinese typists.
>
>Oh, I see. Can you suggest a better markup than space that could be
>more convenient for the Chinese user/Japanese user?
>

I think this depends on the Chinese Input Method used by your users, and
their habit. Like any markup system, this would be certainly not very
naturel and require some extra effort. The best would be to ask your users
what they would prefer.

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 8:16 PM by Emmanuel. M. Decarie
2/23/2002; 8:16 PM by Emmanuel. M. Decarie
16172
Reply to This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16172
>----------------------------------
>
>Hello Emmanuel,

Hello Nobumi,

> >I don't think my text will be very long. But I'm not sure about this
>>"linear search" thing. Is it implying that I need to build a sort of
>>binary tree. Can you please provide some examples.
>>
>>This look that I only need to index not bigrams but chars. Is that
>>right? If its the case, how accurate could be the search engine?
> >
>
>By "linear search" I mean simply the basic search of text like the one that
>one finds in every word-processing program. Say that I search for the word
>"program" in the last sentence, I would do...:
>
>on search (targetWord, str)
> local (len, pos)
> pos = string.patternMatch (targetWord, str)
> if pos != 0
> len = string.length (targetWord)
> return ({pos, pos + len})
> else
> return (false)
>
>local (str = "By linear search I mean only the basic search of text like
>the one that one finds in every word-processing program.")
>local (targetWord = "program")
>print (search (targetWord, str))
>
>which returns {108, 115}
>
>Perhaps this is too simple to be used as a search engine...??

Ok, I understand now. About your question, it might work. But I think
I will going to ask the chinese user to put a space between each
chinese words. This will eliminate a lot of overhead I think.

>---------
>
>In another posting, you wrote:
>
>> >The last is simplest: when typing in Chinese, put in a space between
>> >words! That system works well in the West, keyboards already have
>> > spacebars, and it is simple enough for people to do when they have the
>> >habit. The spaces should be ignored by the publication system: they
>> >should be treated as "zero-width spaces".
>
>> I could tell the users that if it want its chinese text to be indexed,
>> he need to split chinese word with space. I think that if I start from
>> such a text to implement indexing and searching, its going to be much
>> more simpler.
>
>I think this is a good idea. But I imagine it would not be very simple to
>type Chinese text separating each word with a space, because in general, in
>Chinese or Japanese input methods, the space bar is used to trigger the
>conversion of the inputted pronunciation into Chinese or Japanese
>character(s). For example, you type "f-a-n-g", then you press on the space
>bar, and several candidates of characters pronunced "fang" appear in a
>little list box, etc. Of course, you can type spaces in a Chinese or
>Japanese text, but this requires another step (for example, pressing the
>Caps Lock key), which is not naturel for Japanese or Chinese typists.

Oh, I see. Can you suggest a better markup than space that could be
more convenient for the Chinese user/Japanese user?

Thanks again Nobumi for your input, this help me tremendously, and I
like the whole challenge.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 11:02 AM by Samuel Reynolds
2/23/2002; 11:02 AM by Samuel Reynolds
16162
Reply to This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16162
>----------------------------------
>
>On 2/22/2002 9:45 AM, Samuel Reynolds <sam@spinwardstars.com> wrote:
>
>>UTF-8 is the 8-bit subset of unicode that corresponds to
>>the ISO-8859 (Latin-1) "extended ASCII" character set that
>>is the Windows default set. Character 0xnn in UTF-8 is
>>always character 0x00nn in unicode.
>
>UTF-8 only preserves the ASCII character set. You may be thinking of
>ISO-8859-1, in that Unicode chars U+0000 to U+00FF are exactly the
>characters in ISO-8859-1. Win1252 is very similar to ISO-8859-1 but not
>identical.
>
>http://www.ietf.org/rfc/rfc2279.txt
>http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
>http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>
>-Brian

Yes, I was thinking of ISO-8859-1. But I thought that
UTF-8 and ISO-8859-1 were the same. My bad.

Oh, well; live and learn!

- Sam
- - - - - - - - - - - - - - - - - - - - - - - - - -
I'm currently looking for a new position/contract.
Resume at http://spinwardstars.com/vitae/
- - - - - - - - - - - - - - - - - - - - - - - - - -
_____________________________________________
Samuel Reynolds sam@spinwardstars.com
Spinward Stars: http://www.spinwardstars.com/

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 9:01 AM by Nobumi Iyanaga
2/23/2002; 9:01 AM by Nobumi Iyanaga
16167
Reply to This Message [Edit]
Hello Emmanuel,

>
>Also, I had the wrong assumption that 2 chars == 1 word in Simplified
>Chinese when one word in Simplified Chinese could be made by more
>than one ideogram. See for example the word "Internet" on this
>page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
>there are composite words close to what in English and French we
>designate by "neologism".

Definition of a "word" in Chinese is difficult (although I don't know the
modern Chinese at all...). Each character has usually a meaning; and
compositions of several characters have some meanings.

>
> >Another way would be to do a linear search, as the text you quoted says.
> >If your texts to be searched are not very long, I guess this would be the
> >simplest way. With GB text, I think you will be able to use any Frontier
> >string verbs, and even regex search. You may use also Mgrep OSAX to do
> >more reliable search with Chinese text...
>
>I don't think my text will be very long. But I'm not sure about this
>"linear search" thing. Is it implying that I need to build a sort of
>binary tree. Can you please provide some examples.
>
>This look that I only need to index not bigrams but chars. Is that
>right? If its the case, how accurate could be the search engine?
>

By "linear search" I mean simply the basic search of text like the one that
one finds in every word-processing program. Say that I search for the word
"program" in the last sentence, I would do...:

on search (targetWord, str)
local (len, pos)
pos = string.patternMatch (targetWord, str)
if pos != 0
len = string.length (targetWord)
return ({pos, pos + len})
else
return (false)

local (str = "By linear search I mean only the basic search of text like
the one that one finds in every word-processing program.")
local (targetWord = "program")
print (search (targetWord, str))

which returns {108, 115}

Perhaps this is too simple to be used as a search engine...??

---------

In another posting, you wrote:

> >The last is simplest: when typing in Chinese, put in a space between
> >words! That system works well in the West, keyboards already have
> > spacebars, and it is simple enough for people to do when they have the
> >habit. The spaces should be ignored by the publication system: they
> >should be treated as "zero-width spaces".

> I could tell the users that if it want its chinese text to be indexed,
> he need to split chinese word with space. I think that if I start from
> such a text to implement indexing and searching, its going to be much
> more simpler.

I think this is a good idea. But I imagine it would not be very simple to
type Chinese text separating each word with a space, because in general, in
Chinese or Japanese input methods, the space bar is used to trigger the
conversion of the inputted pronunciation into Chinese or Japanese
character(s). For example, you type "f-a-n-g", then you press on the space
bar, and several candidates of characters pronunced "fang" appear in a
little list box, etc. Of course, you can type spaces in a Chinese or
Japanese text, but this requires another step (for example, pressing the
Caps Lock key), which is not naturel for Japanese or Chinese typists.

Good luck, and best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 2:58 AM by jt
2/23/2002; 2:58 AM by jt
16170
Reply to This Message [Edit]
I'm just catching up, and haven't followed all the links. But it seems like
you have a winnah, Emmanuel...!

It would require an adjustment from the writer, which I don't know if
they'll go for... But IF they want the text to be indexed and searched (and
I'd think they would in many cases), then I think they COULD make this
adjustment.

I don't know the culture well enough to imagine whether they WOULD, but
should be easy enough to test on a small scale.


| -----Original Message-----
| From: sm.community@lists.scriptmeridian.org
| [mailto:sm.community@lists.scriptmeridian.org]On Behalf Of Emmanuel. M.
| Decarie
| Sent: Saturday, February 23, 2002 12:04 AM
| To: sm.community@lists.scriptmeridian.org
| Subject: RE: [SM] Simplified Chinese (GB2312) in Manila [Msg#16170]
|
|
| Read on the web at http://community.scriptmeridian.org/16170
| ----------------------------------

<snip>

| I could tell the users that if it want its chinese text to be
| indexed, he need to split chinese word with space. I think that if I
| start from such a text to implement indexing and searching, its going
| to be much more simpler.
|
| What do you think Nobumi (or others interested in the discussion),
| does it make sense for you?
|
| Cheers
| -Emmanuel
|
| --
| ______________________________________________________________________
| Emmanuel Decarie / Programmation pour le Web - Programming for the Web
| Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
|
|
|

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 1:03 AM by Emmanuel. M. Decarie
2/23/2002; 1:03 AM by Emmanuel. M. Decarie
16164
Reply to This Message [Edit]
Thinking more about this, and rereading this page:
<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>, I think
I'll follow the advice from the author:

>I personally feel that the answer is neither dictionaries nor
>parsing but markup. When entering the data, there should either be
>markup of all proper nouns or markup of all word boundaries.
>
>The last is simplest: when typing in Chinese, put in a space between
>words! That system works well in the West, keyboards already have
>spacebars, and it is simple enough for people to do when they have
>the habit. The spaces should be ignored by the publication system:
>they should be treated as "zero-width spaces".

I could tell the users that if it want its chinese text to be
indexed, he need to split chinese word with space. I think that if I
start from such a text to implement indexing and searching, its going
to be much more simpler.

What do you think Nobumi (or others interested in the discussion),
does it make sense for you?

Cheers
-Emmanuel

--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 12:58 AM by Emmanuel. M. Decarie
2/23/2002; 12:58 AM by Emmanuel. M. Decarie
16164
Reply to This Message [Edit]
Thinking more about this, and rereading this page:
<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>, I think
I'll follow the advice from the author:

>I personally feel that the answer is neither dictionaries nor
>parsing but markup. When entering the data, there should either be
>markup of all proper nouns or markup of all word boundaries.
>
>The last is simplest: when typing in Chinese, put in a space between
>words! That system works well in the West, keyboards already have
>spacebars, and it is simple enough for people to do when they have
>the habit. The spaces should be ignored by the publication system:
>they should be treated as "zero-width spaces".

I could tell the users that if it want its chinese text to be
indexed, he need to split chinese word with space. I think that if I
start from such a text to implement indexing and searching, its going
to be much more simpler.

What do you think Nobumi (or others interested in the discussion),
does it make sense for you?

Cheers
-Emmanuel



--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 12:44 AM by Emmanuel. M. Decarie
2/23/2002; 12:44 AM by Emmanuel. M. Decarie
16165
Reply to This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16165
>----------------------------------
>
>Hello Emmanuel,
>
>>Read on the web at http://community.scriptmeridian.org/16153
>>----------------------------------
>>
>>Anyone know this book:
>>http://www.oreilly.com/catalog/cjkvinfo/
>>
>>It look like what I need but its a little bit ancient (published in 1998).
>>
>
>I have this book. It IS certainly the best work on this area, and for your
>need, I think it is not ancient at all.

Thanks Nobumi. I went to the bookstore today and the book look pretty
good for what I need to understand and implement (but its a little
bit expensive). I think I will buy it soon.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 12:44 AM by Emmanuel. M. Decarie
2/23/2002; 12:44 AM by Emmanuel. M. Decarie
16164
Reply to This Message [Edit]
Hello Nobumi,

À (At) 22:24 -0500 22/02/02, Nobumi Iyanaga écrivait (wrote) :
>Read on the web at http://community.scriptmeridian.org/16164
>----------------------------------
>
>Hello Emmanuel,
>
>>
>>Well I found this page which is quite helpful:
> ><http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>
>>
>>Especially this paragraph:
>>
>> >Chinese words are usually made from one to four characters; most are
>> >made from two characters.
>> >
>> >Because there are no reliable indications of word boundaries in most
>> >Chinese text, when you search you cannot use the speed up of
>> >skipping to the next word when a match fails. You have to just do a
>> >linear search.
>> >
>> >Because of the large number of characters, there are many ways of
>> >indexing text which you would not attempt in English. The most
>> >straightforward is to say "one character == one word", and to index
>> >every character (i.e., a fully permuted index.) You would never do
>> >this in English: imagine an index that found every occurance of the
>> >letter "e" in a document!
>> >
>> >You can also index on bigrams (two consecutive characters),
>> >trigrams, or even 4-grams. I read somewhere that anything more than
>> >4-grams is not much use. The cost of these is, unfortuately, that
>> >because you don't know the word boundaries, you will get lots of
>> >spurious n-grams. On the other hand, because word boundaries are
>> >fairly subjective in Chinese (like in English: some people hypenate,
>> >some don't) it is probably good to err on the side of having too
>> >many n-grams anyway.
>
>All this is very true. Indexing double-byte text is a most difficult task.
>
>>What I need is a list of these bigrams in Simplified Chinese.
>>
>>I found such a list in a Perl script called "codelib.pl" already
>>rolled in hash (life is good).
>><http://www.mandarintools.com/download/>
>>
>>There is another script on this page called "segment" that split
>>chinese text in "word" but I can't get it too work on OS X from the
>>terminal.
>
>I downloaded these scripts thanks to your info. It seems that the script
>codelib.pl does not contain a hash of Chinese word list -- it is rather a
>program that attempts to guess the encoding of double-byte text.

Yes, you are right. I wrongly assumed that Chinese Simplified was an
adaptation of cantonese for the Web. But I read more on the subject,
and I know now that Chinese Simplified came from a Mao reform in the
80s (I think) and that it contains around 17,000 ideograms.

Also, I had the wrong assumption that 2 chars == 1 word in Simplified
Chinese when one word in Simplified Chinese could be made by more
than one ideogram. See for example the word "Internet" on this
page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
there are composite words close to what in English and French we
designate by "neologism".


>On the other hand, the script segment.pl and other files contained in the
>same folder will attempt to extract words from a Simplified Chinese text.
>It contains a list of 119804 (!) Chinese words (what a work...!) (which are
>not only bigrams, but tri-grams, quadri-grams, etc.). Even if you cannot
>run the perl script itself (I didn't try it...), you could load this word
>list in a Frontier, and use it to "segment" a given Chinese text. But I
>think the task would not be easy -- and the result may not be very
>satisfactory.

Yes, this look like a difficult task. Frontier will have to load this
huge table with 119000+ items in memory and will need to go thru all
items to see if they match. I did a some tests (but without success)
with Perl on OS X and the script still need a couple of seconds to
end its run on 3 bigrams. I have to test this again, but it doesn't
look very promising.

(snip)


>Another way would be to do a linear search, as the text you quoted says.
>If your texts to be searched are not very long, I guess this would be the
>simplest way. With GB text, I think you will be able to use any Frontier
>string verbs, and even regex search. You may use also Mgrep OSAX to do
>more reliable search with Chinese text...

I don't think my text will be very long. But I'm not sure about this
"linear search" thing. Is it implying that I need to build a sort of
binary tree. Can you please provide some examples.

This look that I only need to index not bigrams but chars. Is that
right? If its the case, how accurate could be the search engine?

Thanks again Nobumi for your input.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

Messages: 29.
Pages: Previous 1 | 2 Next