RE: Simplified Chinese (GB2312) in Manila
Posted
Last Modified
In Response To
2/23/2002; 8:01 AM by Nobumi IyanagaLast Modified
In Response To
2/23/2002; 8:01 AM by Nobumi Iyanaga
RE: Simplified Chinese (GB2312) in Manila (#16167)
Reply To This Message [Edit]
Hello Emmanuel,
>
>Also, I had the wrong assumption that 2 chars == 1 word in Simplified
>Chinese when one word in Simplified Chinese could be made by more
>than one ideogram. See for example the word "Internet" on this
>page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
>there are composite words close to what in English and French we
>designate by "neologism".
Definition of a "word" in Chinese is difficult (although I don't know the
modern Chinese at all...). Each character has usually a meaning; and
compositions of several characters have some meanings.
>
> >Another way would be to do a linear search, as the text you quoted says.
> >If your texts to be searched are not very long, I guess this would be the
> >simplest way. With GB text, I think you will be able to use any Frontier
> >string verbs, and even regex search. You may use also Mgrep OSAX to do
> >more reliable search with Chinese text...
>
>I don't think my text will be very long. But I'm not sure about this
>"linear search" thing. Is it implying that I need to build a sort of
>binary tree. Can you please provide some examples.
>
>This look that I only need to index not bigrams but chars. Is that
>right? If its the case, how accurate could be the search engine?
>
By "linear search" I mean simply the basic search of text like the one that
one finds in every word-processing program. Say that I search for the word
"program" in the last sentence, I would do...:
on search (targetWord, str)
local (len, pos)
pos = string.patternMatch (targetWord, str)
if pos != 0
len = string.length (targetWord)
return ({pos, pos + len})
else
return (false)
local (str = "By linear search I mean only the basic search of text like
the one that one finds in every word-processing program.")
local (targetWord = "program")
print (search (targetWord, str))
which returns {108, 115}
Perhaps this is too simple to be used as a search engine...??
---------
In another posting, you wrote:
> >The last is simplest: when typing in Chinese, put in a space between
> >words! That system works well in the West, keyboards already have
> > spacebars, and it is simple enough for people to do when they have the
> >habit. The spaces should be ignored by the publication system: they
> >should be treated as "zero-width spaces".
> I could tell the users that if it want its chinese text to be indexed,
> he need to split chinese word with space. I think that if I start from
> such a text to implement indexing and searching, its going to be much
> more simpler.
I think this is a good idea. But I imagine it would not be very simple to
type Chinese text separating each word with a space, because in general, in
Chinese or Japanese input methods, the space bar is used to trigger the
conversion of the inputted pronunciation into Chinese or Japanese
character(s). For example, you type "f-a-n-g", then you press on the space
bar, and several candidates of characters pronunced "fang" appear in a
little list box, etc. Of course, you can type spaces in a Chinese or
Japanese text, but this requires another step (for example, pressing the
Caps Lock key), which is not naturel for Japanese or Chinese typists.
Good luck, and best regards,
Nobumi Iyanaga
Tokyo,
Japan
>
>Also, I had the wrong assumption that 2 chars == 1 word in Simplified
>Chinese when one word in Simplified Chinese could be made by more
>than one ideogram. See for example the word "Internet" on this
>page:<http://www.cjk.org/cjk/samples/chincome.htm>. Maybe the words
>there are composite words close to what in English and French we
>designate by "neologism".
Definition of a "word" in Chinese is difficult (although I don't know the
modern Chinese at all...). Each character has usually a meaning; and
compositions of several characters have some meanings.
>
> >Another way would be to do a linear search, as the text you quoted says.
> >If your texts to be searched are not very long, I guess this would be the
> >simplest way. With GB text, I think you will be able to use any Frontier
> >string verbs, and even regex search. You may use also Mgrep OSAX to do
> >more reliable search with Chinese text...
>
>I don't think my text will be very long. But I'm not sure about this
>"linear search" thing. Is it implying that I need to build a sort of
>binary tree. Can you please provide some examples.
>
>This look that I only need to index not bigrams but chars. Is that
>right? If its the case, how accurate could be the search engine?
>
By "linear search" I mean simply the basic search of text like the one that
one finds in every word-processing program. Say that I search for the word
"program" in the last sentence, I would do...:
on search (targetWord, str)
local (len, pos)
pos = string.patternMatch (targetWord, str)
if pos != 0
len = string.length (targetWord)
return ({pos, pos + len})
else
return (false)
local (str = "By linear search I mean only the basic search of text like
the one that one finds in every word-processing program.")
local (targetWord = "program")
print (search (targetWord, str))
which returns {108, 115}
Perhaps this is too simple to be used as a search engine...??
---------
In another posting, you wrote:
> >The last is simplest: when typing in Chinese, put in a space between
> >words! That system works well in the West, keyboards already have
> > spacebars, and it is simple enough for people to do when they have the
> >habit. The spaces should be ignored by the publication system: they
> >should be treated as "zero-width spaces".
> I could tell the users that if it want its chinese text to be indexed,
> he need to split chinese word with space. I think that if I start from
> such a text to implement indexing and searching, its going to be much
> more simpler.
I think this is a good idea. But I imagine it would not be very simple to
type Chinese text separating each word with a space, because in general, in
Chinese or Japanese input methods, the space bar is used to trigger the
conversion of the inputted pronunciation into Chinese or Japanese
character(s). For example, you type "f-a-n-g", then you press on the space
bar, and several candidates of characters pronunced "fang" appear in a
little list box, etc. Of course, you can type spaces in a Chinese or
Japanese text, but this requires another step (for example, pressing the
Caps Lock key), which is not naturel for Japanese or Chinese typists.
Good luck, and best regards,
Nobumi Iyanaga
Tokyo,
Japan
Enclosures
None.
Replies
RE: Simplified Chinese (GB2312) in Manila
2/23/2002 by Emmanuel. M. Decarie
>Read on the web at http://community.scriptmeridian.org/16172 >---------------------------------- > >Hello Emmanuel,
2/23/2002 by Emmanuel. M. Decarie