A community of Frontier
and Radio Users


Meridian News


Community List


Regex Project

Simplified Chinese (GB2312) in Manila

Shown in forward chronological order.
Reverse chronological order | Hierarchical outline view

Messages: 21 - 29 of 29.
Pages: Previous 1 | 2 Next

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 11:02 AM by Samuel Reynolds
2/23/2002; 11:02 AM by Samuel Reynolds
16162
Reply to This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16162
>----------------------------------
>
>On 2/22/2002 9:45 AM, Samuel Reynolds <sam@spinwardstars.com> wrote:
>
>>UTF-8 is the 8-bit subset of unicode that corresponds to
>>the ISO-8859 (Latin-1) "extended ASCII" character set that
>>is the Windows default set. Character 0xnn in UTF-8 is
>>always character 0x00nn in unicode.
>
>UTF-8 only preserves the ASCII character set. You may be thinking of
>ISO-8859-1, in that Unicode chars U+0000 to U+00FF are exactly the
>characters in ISO-8859-1. Win1252 is very similar to ISO-8859-1 but not
>identical.
>
>http://www.ietf.org/rfc/rfc2279.txt
>http://www.microsoft.com/globaldev/reference/sbcs/1252.htm
>http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
>http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>
>-Brian

Yes, I was thinking of ISO-8859-1. But I thought that
UTF-8 and ISO-8859-1 were the same. My bad.

Oh, well; live and learn!

- Sam
- - - - - - - - - - - - - - - - - - - - - - - - - -
I'm currently looking for a new position/contract.
Resume at http://spinwardstars.com/vitae/
- - - - - - - - - - - - - - - - - - - - - - - - - -
_____________________________________________
Samuel Reynolds sam@spinwardstars.com
Spinward Stars: http://www.spinwardstars.com/

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/23/2002; 8:16 PM by Emmanuel. M. Decarie
2/23/2002; 8:16 PM by Emmanuel. M. Decarie
16172
Reply to This Message [Edit]
>Read on the web at http://community.scriptmeridian.org/16172
>----------------------------------
>
>Hello Emmanuel,

Hello Nobumi,

> >I don't think my text will be very long. But I'm not sure about this
>>"linear search" thing. Is it implying that I need to build a sort of
>>binary tree. Can you please provide some examples.
>>
>>This look that I only need to index not bigrams but chars. Is that
>>right? If its the case, how accurate could be the search engine?
> >
>
>By "linear search" I mean simply the basic search of text like the one that
>one finds in every word-processing program. Say that I search for the word
>"program" in the last sentence, I would do...:
>
>on search (targetWord, str)
> local (len, pos)
> pos = string.patternMatch (targetWord, str)
> if pos != 0
> len = string.length (targetWord)
> return ({pos, pos + len})
> else
> return (false)
>
>local (str = "By linear search I mean only the basic search of text like
>the one that one finds in every word-processing program.")
>local (targetWord = "program")
>print (search (targetWord, str))
>
>which returns {108, 115}
>
>Perhaps this is too simple to be used as a search engine...??

Ok, I understand now. About your question, it might work. But I think
I will going to ask the chinese user to put a space between each
chinese words. This will eliminate a lot of overhead I think.

>---------
>
>In another posting, you wrote:
>
>> >The last is simplest: when typing in Chinese, put in a space between
>> >words! That system works well in the West, keyboards already have
>> > spacebars, and it is simple enough for people to do when they have the
>> >habit. The spaces should be ignored by the publication system: they
>> >should be treated as "zero-width spaces".
>
>> I could tell the users that if it want its chinese text to be indexed,
>> he need to split chinese word with space. I think that if I start from
>> such a text to implement indexing and searching, its going to be much
>> more simpler.
>
>I think this is a good idea. But I imagine it would not be very simple to
>type Chinese text separating each word with a space, because in general, in
>Chinese or Japanese input methods, the space bar is used to trigger the
>conversion of the inputted pronunciation into Chinese or Japanese
>character(s). For example, you type "f-a-n-g", then you press on the space
>bar, and several candidates of characters pronunced "fang" appear in a
>little list box, etc. Of course, you can type spaces in a Chinese or
>Japanese text, but this requires another step (for example, pressing the
>Caps Lock key), which is not naturel for Japanese or Chinese typists.

Oh, I see. Can you suggest a better markup than space that could be
more convenient for the Chinese user/Japanese user?

Thanks again Nobumi for your input, this help me tremendously, and I
like the whole challenge.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/24/2002; 8:45 AM by Nobumi Iyanaga
2/24/2002; 8:45 AM by Nobumi Iyanaga
16181
Reply to This Message [Edit]
Hello Emmanuel,

> >
> >I think this is a good idea. But I imagine it would not be very simple to
> >type Chinese text separating each word with a space, because in general, in
> >Chinese or Japanese input methods, the space bar is used to trigger the
> >conversion of the inputted pronunciation into Chinese or Japanese
> >character(s). For example, you type "f-a-n-g", then you press on the space
> >bar, and several candidates of characters pronunced "fang" appear in a
> >little list box, etc. Of course, you can type spaces in a Chinese or
> >Japanese text, but this requires another step (for example, pressing the
> >Caps Lock key), which is not naturel for Japanese or Chinese typists.
>
>Oh, I see. Can you suggest a better markup than space that could be
>more convenient for the Chinese user/Japanese user?
>

I think this depends on the Chinese Input Method used by your users, and
their habit. Like any markup system, this would be certainly not very
naturel and require some extra effort. The best would be to ask your users
what they would prefer.

Best regards,

Nobumi Iyanaga
Tokyo,
Japan

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/24/2002; 2:57 PM by Emmanuel. M. Decarie
2/24/2002; 2:57 PM by Emmanuel. M. Decarie
16185
Reply to This Message [Edit]
Hello Nobumi,

>Read on the web at http://community.scriptmeridian.org/16185
>----------------------------------

(...)

> >Oh, I see. Can you suggest a better markup than space that could be
>>more convenient for the Chinese user/Japanese user?
>>
>
>I think this depends on the Chinese Input Method used by your users, and
>their habit. Like any markup system, this would be certainly not very
>naturel and require some extra effort. The best would be to ask your users
>what they would prefer.

I'm a little bit scare of letting the user to choose a itself a
markup because this could lead to a lot of programming headaches.

Maybe the choice of markup could be limited to 3 or 4 type of markup.
I need to think a little bit more about this.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/24/2002; 10:16 PM by Jonathan Lewis
2/24/2002; 10:16 PM by Jonathan Lewis
16140
Reply to This Message [Edit]
Dear Nobumi,

Sorry for the long delay in replying. Thanks for the tip about browsers. I'll take a look at Omniweb and, gulp, Windows browsers.

How do you render UTF-8 text in Frontier...? As Frontier is not Unicode savvy, I think all text in UTF-8 is garbled in Frontier's windows...? Or do you convert the text from legacy codes to UTF-8 on the fly?

As you say, all UTF-8 text is garbled in Frontier. With message contents, subjects etc., everything can easily be edited in the browser because this is Manila. So I never need to edit those directly in Frontier. The fact that the server is in deepest Saitama, while I'm usually working on the site here in Daiba, is another reason why I don't mind that I can't edit text directly in Frontier.

I have also produced very rudimentary Japanese and (thanks to a friend from China) Simplified Chinese versions of Manila's localization tables, which display page furniture such as the Edit this Page button in the user's language. With those, I input the text directly into Frontier, then run a little script to change the text into UTF-8 using the TEC extension. If I want to edit those strings, I edit the originals then reconvert them to UTF-8.

Best,

Jonathan

Re: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/26/2002; 2:04 PM by Henri Asseily
2/26/2002; 2:04 PM by Henri Asseily
16181
Reply to This Message [Edit]
Sorry to chime in so late, and slightly off topic, but I have a couple
of pieces of code that might be of interest:

First, I have developed a fairly optimized piece of Perl code that will
search a full web page (or any long text string) and find all
occurrences of words in a predefined list.
By "fairly optimized" I mean that in 10 milliseconds it can find all
occurrences of 10,000 words in a 100k web page (translating into 50k of
real text). The system is set up as a web service in Apache/mod_perl,
where you predefine a list of words and then call it via http with the
web page you want to parse. It's completely scalable to billions of
searches a day.

The problem with it is that it uses word boundaries, which is no good to
you in the case of Chinese it seems.

However, I also have a VERY optimized piece of C code written by one of
my engineers that streams text and finds all occurrences of a list of
words in that text. The difference with the above is not only that it
does it in less than 1 millisecond, but it doesn't need word boundaries
as it checks each character as it comes in for a match. This would
probably be the solution for you.
You may want to take a look at the code and see if you can replicate it
in Usertalk or perl or whatever, it's a very small but extremely
efficient algorithm.

Let me know.

Henri.



>>> I don't think my text will be very long. But I'm not sure about this
>>> "linear search" thing. Is it implying that I need to build a sort of
>>> binary tree. Can you please provide some examples.
>>>
>>> This look that I only need to index not bigrams but chars. Is that
>>> right? If its the case, how accurate could be the search engine?
>>>
>>
>> By "linear search" I mean simply the basic search of text like the one
>> that
>> one finds in every word-processing program. Say that I search for the
>> word
>> "program" in the last sentence, I would do...:
>>
>> on search (targetWord, str)
>> local (len, pos)
>> pos = string.patternMatch (targetWord, str)
>> if pos != 0
>> len = string.length (targetWord)
>> return ({pos, pos + len})
>> else
>> return (false)
>>
>> local (str = "By linear search I mean only the basic search of text
>> like
>> the one that one finds in every word-processing program.")
>> local (targetWord = "program")
>> print (search (targetWord, str))
>>
>> which returns {108, 115}
>>
>> Perhaps this is too simple to be used as a search engine...??
>
> Ok, I understand now. About your question, it might work. But I think
> I will going to ask the chinese user to put a space between each
> chinese words. This will eliminate a lot of overhead I think.
>
>> ---------
>>
>> In another posting, you wrote:
>>
>>>> The last is simplest: when typing in Chinese, put in a space between
>>>> words! That system works well in the West, keyboards already have
>>>> spacebars, and it is simple enough for people to do when they have
>>>> the
>>>> habit. The spaces should be ignored by the publication system: they
>>>> should be treated as "zero-width spaces".
>>
>>> I could tell the users that if it want its chinese text to be
>>> indexed,
>>> he need to split chinese word with space. I think that if I start
>>> from
>>> such a text to implement indexing and searching, its going to be much
>>> more simpler.
>>
>> I think this is a good idea. But I imagine it would not be very
>> simple to
>> type Chinese text separating each word with a space, because in
>> general, in
>> Chinese or Japanese input methods, the space bar is used to trigger the
>> conversion of the inputted pronunciation into Chinese or Japanese
>> character(s). For example, you type "f-a-n-g", then you press on the
>> space
>> bar, and several candidates of characters pronunced "fang" appear in a
>> little list box, etc. Of course, you can type spaces in a Chinese or
>> Japanese text, but this requires another step (for example, pressing
>> the
>> Caps Lock key), which is not naturel for Japanese or Chinese typists.
>
> Oh, I see. Can you suggest a better markup than space that could be
> more convenient for the Chinese user/Japanese user?
>
> Thanks again Nobumi for your input, this help me tremendously, and I
> like the whole challenge.
>
> Cheers
> -Emmanuel
> --
> ______________________________________________________________________
> Emmanuel Decarie / Programmation pour le Web - Programming for the Web
> Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
>
>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/26/2002; 2:45 PM by Emmanuel. M. Decarie
2/26/2002; 2:45 PM by Emmanuel. M. Decarie
16211
Reply to This Message [Edit]
Hello Henri,

This sound very interesting.

Could you put the Perl and C codes on a web page so other interested
could download the codes?

I don't have a enough C knowledge, but if someone could turn the C
code in a dll, that could solve a lot of problems for double byte
languages.

>Read on the web at http://community.scriptmeridian.org/16211
>----------------------------------
>
>Sorry to chime in so late, and slightly off topic, but I have a couple
>of pieces of code that might be of interest:
>
>First, I have developed a fairly optimized piece of Perl code that will
>search a full web page (or any long text string) and find all
>occurrences of words in a predefined list.
>By "fairly optimized" I mean that in 10 milliseconds it can find all
>occurrences of 10,000 words in a 100k web page (translating into 50k of
>real text). The system is set up as a web service in Apache/mod_perl,
>where you predefine a list of words and then call it via http with the
>web page you want to parse. It's completely scalable to billions of
>searches a day.
>
>The problem with it is that it uses word boundaries, which is no good to
>you in the case of Chinese it seems.
>
>However, I also have a VERY optimized piece of C code written by one of
>my engineers that streams text and finds all occurrences of a list of
>words in that text.

In what format is this list of words? If its a text file, it could be
easily loaded in a GDB in Frontier so the dll could use this GDB for
a search (I have 119000 words in a CJK words list). Does it sound
doable? Is a table with 119000+ items is too wide to be efficient in
Frontier?

>The difference with the above is not only that it
>does it in less than 1 millisecond, but it doesn't need word boundaries
>as it checks each character as it comes in for a match. This would
>probably be the solution for you.

Yes, this look very promising because it doesn't oblige the user to
change the way he/she write.

>You may want to take a look at the code and see if you can replicate it
>in Usertalk or perl or whatever, it's a very small but extremely
>efficient algorithm.

Thanks Henri for your input.

Cheers
-Emmanuel


--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
2/28/2002; 6:33 PM by Phil Suh
2/28/2002; 6:33 PM by Phil Suh
16121
Reply to This Message [Edit]
On Thu, 21 Feb 2002, Nobumi Iyanaga wrote:
> And Phil, are you still interested in rendering Japanese text with
> Frontier?

Honto ni hisashiburi desu ne. Wow, it's good to hear from you, Nobumi.
Like old times.

I'm still interested in rendering Japanese text with Frontier, and now
Radio. But I'm afraid that there are still major limitations, which
Emmanuel, Daniel, et al are running into.


WHAT WORKS

Frontier and Radio will take whatever you give it and store it in the ODB
(object database). So if you hack the html form templates correctly so
that the browser sends the correctly encoded text, Frontier/Radio will
blissfully store it in the correct manner. And again, as Frontier/Radio
pulls a message out of the ODB for display, it does not touch it, and it
will work fine.


WHAT DOESNT WORK

The problem comes when you attempt to manipulate that text--either in a
regex, or with one of the builtins.string verbs.

Since Frontier is not Unicode savvy, it will wind up garbling the text.
In Japanese this is called mojibake. In English, sadness and despair.


SUMMARY

You can store and retrieve text, with a little juggling. Anything
interesting, however (searching, regex, any sort of string manipulation,
running text through macros or the renderer) will not work. I'm thinking
primarily of *Japanese text* here, I don't have experience with Simplified
Chinese.


WITH REGARD TO SPACING IN CHINESE

Emmanuel, the strategy of asking your users to add spaces between words is
technically feasible but I think culturally misplaced. Written Japanese
and Chinese don't use spaces between words. Asking your users to add
spaces is not likely to work. It's similar to asking English or French
writers to write *without* spaces. Just not done.


THE BOOK

Ah yes, Ken Lund's CKJV Information Processing is a work of art. It's one
of the few computer books on my shelf that makes me smile when I pick it
up. "Everything I'll ever need to know about this topic is in my hands." A
very satisfying feeling. This kind of info does not go oout of date
quickly--my book is a first printing, January 1999.


MY OLD, BROKEN, OUT OF DATE SITE

http://filsa.net/frontier/polyglot/

Has some discussions about Japanese in Frontier from, geez, ages ago.


USERLAND AND UNICODE

Userland's COO Jonh Robb wrote me last year to ask what the status of
Unicode in Frontier was (he saw my polyglot site). I wrote a long
response, which, because it is informative, will forward to this list.

I can understand why Userland has yet to put Unicode support into
Frontier/Radio. It's expensive. And somewhat risky--it's messing around in
the kernel. It's a lot of developer time in the trenches on a not-so-sexy
feature.

OTOH, I think it's a necessary and *practical* feature--and it's also the
way of the world. Every app should, IMHO, support all the world's
languages, because 1) there are supportable standards, 2) it's technically
possible, 3) the world is a smaller place, and 4) the English is only 1 of
the worlds 4 major langauge groups (Hindi, Mandarin, and Spanish)... but
I'm ranting.

Cheers,

Phil

(just got caught up on this thread--and this thread only. Man you guys are
talky.)

RE: Simplified Chinese (GB2312) in Manila

Date Posted 
Date Modified 
In Response To 
 
3/1/2002; 12:01 AM by Emmanuel. M. Decarie
3/1/2002; 12:01 AM by Emmanuel. M. Decarie
16228
Reply to This Message [Edit]
À (At) 17:33 -0500 28/02/02, Phil Suh écrivait (wrote) :
>WITH REGARD TO SPACING IN CHINESE
>
>Emmanuel, the strategy of asking your users to add spaces between words is
>technically feasible but I think culturally misplaced. Written Japanese
>and Chinese don't use spaces between words. Asking your users to add
>spaces is not likely to work. It's similar to asking English or French
>writers to write *without* spaces. Just not done.

Hello Phil,

Thanks for your great input.

This idea of adding space didn't come from me but from this web page
(<http://www.ascc.net/xml/en/utf-8/faq/zhl10n-faq-xsl2.html>) which
is a faq on how to process Chinese text. But Nobumi said the same
thing that you said about space between word, so I think that this
solution is not really a solution.

I think that I will maybe run a web service on a FreeBSD or Linux
machine and let Perl index these pages.

There is a nice callback at
config.mainresponder.callbacks.storePageForIndexing. From there you
can decide if you want Frontier to index the page or not, and from
there, you get tons of info on the page (the text, the url, the name
of the site and so on..). So it could be trivial I think to send via
xml-rpc the content of the page on another server running Perl for
indexing.

Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>

Messages: 29.
Pages: Previous 1 | 2 Next