RE: Simplified Chinese (GB2312) in Manila
Posted
Last Modified
In Response To
2/26/2002; 1:45 PM by Emmanuel. M. DecarieLast Modified
In Response To
2/26/2002; 1:45 PM by Emmanuel. M. Decarie
Re: Simplified Chinese (GB2312) in Manila (#16211)
Reply To This Message [Edit]
Hello Henri,
This sound very interesting.
Could you put the Perl and C codes on a web page so other interested
could download the codes?
I don't have a enough C knowledge, but if someone could turn the C
code in a dll, that could solve a lot of problems for double byte
languages.
>Read on the web at http://community.scriptmeridian.org/16211
>----------------------------------
>
>Sorry to chime in so late, and slightly off topic, but I have a couple
>of pieces of code that might be of interest:
>
>First, I have developed a fairly optimized piece of Perl code that will
>search a full web page (or any long text string) and find all
>occurrences of words in a predefined list.
>By "fairly optimized" I mean that in 10 milliseconds it can find all
>occurrences of 10,000 words in a 100k web page (translating into 50k of
>real text). The system is set up as a web service in Apache/mod_perl,
>where you predefine a list of words and then call it via http with the
>web page you want to parse. It's completely scalable to billions of
>searches a day.
>
>The problem with it is that it uses word boundaries, which is no good to
>you in the case of Chinese it seems.
>
>However, I also have a VERY optimized piece of C code written by one of
>my engineers that streams text and finds all occurrences of a list of
>words in that text.
In what format is this list of words? If its a text file, it could be
easily loaded in a GDB in Frontier so the dll could use this GDB for
a search (I have 119000 words in a CJK words list). Does it sound
doable? Is a table with 119000+ items is too wide to be efficient in
Frontier?
>The difference with the above is not only that it
>does it in less than 1 millisecond, but it doesn't need word boundaries
>as it checks each character as it comes in for a match. This would
>probably be the solution for you.
Yes, this look very promising because it doesn't oblige the user to
change the way he/she write.
>You may want to take a look at the code and see if you can replicate it
>in Usertalk or perl or whatever, it's a very small but extremely
>efficient algorithm.
Thanks Henri for your input.
Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
This sound very interesting.
Could you put the Perl and C codes on a web page so other interested
could download the codes?
I don't have a enough C knowledge, but if someone could turn the C
code in a dll, that could solve a lot of problems for double byte
languages.
>Read on the web at http://community.scriptmeridian.org/16211
>----------------------------------
>
>Sorry to chime in so late, and slightly off topic, but I have a couple
>of pieces of code that might be of interest:
>
>First, I have developed a fairly optimized piece of Perl code that will
>search a full web page (or any long text string) and find all
>occurrences of words in a predefined list.
>By "fairly optimized" I mean that in 10 milliseconds it can find all
>occurrences of 10,000 words in a 100k web page (translating into 50k of
>real text). The system is set up as a web service in Apache/mod_perl,
>where you predefine a list of words and then call it via http with the
>web page you want to parse. It's completely scalable to billions of
>searches a day.
>
>The problem with it is that it uses word boundaries, which is no good to
>you in the case of Chinese it seems.
>
>However, I also have a VERY optimized piece of C code written by one of
>my engineers that streams text and finds all occurrences of a list of
>words in that text.
In what format is this list of words? If its a text file, it could be
easily loaded in a GDB in Frontier so the dll could use this GDB for
a search (I have 119000 words in a CJK words list). Does it sound
doable? Is a table with 119000+ items is too wide to be efficient in
Frontier?
>The difference with the above is not only that it
>does it in less than 1 millisecond, but it doesn't need word boundaries
>as it checks each character as it comes in for a match. This would
>probably be the solution for you.
Yes, this look very promising because it doesn't oblige the user to
change the way he/she write.
>You may want to take a look at the code and see if you can replicate it
>in Usertalk or perl or whatever, it's a very small but extremely
>efficient algorithm.
Thanks Henri for your input.
Cheers
-Emmanuel
--
______________________________________________________________________
Emmanuel Décarie / Programmation pour le Web - Programming for the Web
Frontier - Perl - Javascript - XML <http://scriptdigital.com/>
Enclosures
None.
Replies
None.