Дискуссия №13151

Language detector

Hello, good news

I've put back the language auto detection, it now use a homebrew tools based on statistics made with dump of the database (so it's a good example of what can be achieved with non-direct use of the data, you know, eat your own dog food etc.)

if there's no bug left, it's supposed to have a accuracy of around 98%, so as it's not 100% it still can detect badly sentences especially for very new users or people speaking a lot of languages belonging to the same group (e.g spanish, italian, portuguese) and adding tricky sentences.

So if you think a sentence was misdetected even though it was not a really tricky one, just post it there, this way i will try to see why, and improve the tools

for those interested by the code, it's actually a standalone API server (with for the moment only one API call), so it can be reused for other project needing automatic detection
the code is here
https://github.com/sysko/Tatodetect
under AGPL v3
it basically require cppcms lib , and boost locale, cppdb and sqlite3 lib
it compile fine on ubuntu 12.04 and on debian squid (you need to activate the c++11 support in the build option)
(I will write later more "precise" build instruction)

I will try to detail also later the basic mechanism behind, for those who can read code, nearly all the magic is done in src/model/Detects.cpp

спрятать ответы показать ответы

sysko 30 июля 2012 г. 30 июля 2012 г., 18:38:46 UTC

link

Пермалинк

In case some asked, it's very easily possible to generate a database with other "data", but i think tatoeba already give quite good result

спрятать ответы показать ответы

GrizaLeono 3 августа 2012 г. 3 августа 2012 г., 14:00:43 UTC

link

Пермалинк

Bedankt, Sysko! Het werkt heel goed.
Dankon, Sysko! Ĝi bonege funkcias.
Merci, Sysko! Cela marche tres bien.

sacredceltic 30 июля 2012 г. 30 июля 2012 г., 18:41:31 UTC

link

Пермалинк

Bravo et merci, sysko ! Ça serait super si Tatoeba devenait une référence de la détection de langues !

Shadd 30 июля 2012 г. 30 июля 2012 г., 19:35:57 UTC

link

Пермалинк

Excuse me but it's not really clear to me where I can find it or what it is, even though I got how it works. What's its purpose for the project?

спрятать ответы показать ответы

sacredceltic 30 июля 2012 г. 30 июля 2012 г., 20:19:04 UTC

link

Пермалинк

The purpose is to automatically detect the language of the sentence you enter in Tatoeba, rather than to select which language it is on your own. It is very helpful to Tatoeba because many people forget to change the flag of their sentences to the right one, when they contribute in multiple languages.
To solve this, you select "automatic detection" in the languages list for your sentence, once (afterwards, that choice is automatically retained), and here you go, spoiled child...
Furthermore, sysko designed it as an API that can be used from anywhere to detect the language of any sentence, based on Tatoeba sentences statistics...
That means Tatoeba's corpus might be used as a base for language detection for other services.

спрятать ответы показать ответы

Amastan 30 июля 2012 г. 30 июля 2012 г., 20:42:15 UTC

link

Пермалинк

That's great!!! I have waited this feature for a long time. Sometimes, the language you want to translate into is 6 miles down the list and it's difficult to use the language list when you keep changing languages. Now, I can shift from any language to any language without wasting much time ^v^ Thank you Sysko :-)

спрятать ответы показать ответы

sacredceltic 30 июля 2012 г. 30 июля 2012 г., 20:50:45 UTC

link

Пермалинк

>Sometimes, the language you want to translate into is 6 miles down the list

This one is yet a different issue: You can also set parameters in your profile to limit the languages that you see and use.

go to http://tatoeba.org/fre/user/settings and enter languages 3 letters ISO codes, separated by commas, of the languages you want to use : < ber, eng, ara, fre > for example...

This way, the list is limited...

Tatoeba is soooooo resourceful !

спрятать ответы показать ответы

Amastan 30 июля 2012 г. 30 июля 2012 г., 21:51:07 UTC

link

Пермалинк

Thank you very much ^v^

teskmon 30 июля 2012 г. 30 июля 2012 г., 21:31:49 UTC

link

Пермалинк

Firefox allows you to "find as you type" when the drop-down list is active. Just type the first letters of the language.

спрятать ответы показать ответы

marcelostockle 30 июля 2012 г. 30 июля 2012 г., 21:42:12 UTC

link

Пермалинк

I do that, and that's why I use the German interface:
German is 'D'
English is 'E'
Esperanto is 'EE'
Japanese is 'J'
and Spanish is 'SP'

sacredceltic 30 июля 2012 г. 30 июля 2012 г., 21:48:29 UTC

link

Пермалинк

yep, but that doesn't filter the translations you don't want to see...
With now over 100 languages, and given all the variants, a sentence can sometimes have 100+ translations. Profile parameters enable you to filter them, not only to reduce the list...

Shadd 31 июля 2012 г. 31 июля 2012 г., 01:06:50 UTC

link

Пермалинк

Oh, now I see, thanks.

sacredceltic 30 июля 2012 г. 30 июля 2012 г., 20:21:23 UTC

link

Пермалинк

http://tatoeba.org/fre/sentences/show/1517450

jakov 31 июля 2012 г. 31 июля 2012 г., 08:15:46 UTC

link

Пермалинк

*Awsome*!

Would it theoretically make a difference for the software to know which sentences it detected wrongly, or would the detection improve by just taking a new dump from the database (including the sentences that were detected wrongly and manually corrected).

What I mean is: Would it make sense to have a call to the API saying "wrong guess" when someone changes the flag by hand? Would the memory of what was guessed wrongly improve the program?

спрятать ответы показать ответы

sysko 31 июля 2012 г. 31 июля 2012 г., 14:49:35 UTC

link

Пермалинк

for the moment the algorithm is really naive and no "learning" yet (away from using the new dump every week), my first goal was to get quickly something to replace the google API

I'm have some notions of machine learning, but I'm not familiar enough to quickly integrate it, but I think yes in the future we can imagine that a bad detection decrease the score of the n-gram of the "bad language" contained in the sentence, and increase the score for the language set by the user

I will try to put in the wiki page of the project the raw algorithm so that after everyone with computer science knowledge would be able to propose improvement as I don't pretend to be of any expertise on this field.

sacredceltic 31 июля 2012 г. 31 июля 2012 г., 11:20:14 UTC

link

Пермалинк

@sysko

do you maintain a probability value for each word in each language ?

спрятать ответы показать ответы

sysko 31 июля 2012 г. 31 июля 2012 г., 15:11:28 UTC

link

Пермалинк

no for two reasons
1 - I don't know have a "dictionnary" for every languages (especially for language like berbere , shanghainese, etc.)
2 - and I can not admit the space separator to "split" words as you know that will not work for Japanese, Chinese etc.

Moreover that will require using other data sources, which are not that easy to found under a "open" licence for a lot of languages. And that would have require maybe doing some "one language"-specific code

Instead I use a quite naive and simple method which is to cut the sentences in group of 2,3,4 and 5 characters (regardless if they are punctuations or so)

for example "Bonjour[narrow non break space]!" for the "5 by 5" table will be

Bonjo
onjou
njour
jour[non breakable space]
our[non-breakable]!

for each couple (language,5-gram) I keep two numbers the absolute number of times it appears in Tatoeba and the relative frequency in that language
(same for 4-grams,3-grams,2-grams)
I delete the couples that appear too fews times as they may come from "tricky sentence" (for example like a sentence that say
"你好" is how we say hello in Chinese" )

after this when you enter a sentence

this sentence is first splitted in group of 5 letters the same way as decribe above
I search all the languages that contain the first group

in a "absolute" table i keep the absolute score of this 5-gram in all the language it appears
in a relative table i keep the relative score
if this 5-gram appears in only one language (the reason why i delete those who appears too few), the absolute and relative score for this given 5-gram is multiplied by a bonus as a "unique" 5-gram is more relevant

we continue for each 5-gram, summings the score

after we get the two first languages in both score , and we keep the "1st language" that have more "distance" with the 2nd of its table

so that for example at the end if we have something like that

absolute table

1st: Mandarin: 10000
2st: Cantonese: 9800
[...]
nth: Teochew 500 (because we have really few sentence in this language)

relative score

1st Teochew : 0,50
2nd Mandarin: 0,01
[...]

We will detect it as Teochew

This way we both keep in account that some language are more likely to appears than other (an unknown user is more likely to contribute in German than in Low-saxon for example) but still give chance to "rare" languages

moreover if it is contributed by a known-user, we don't keep in the table the languages in which this user does not contribute (so someone who contribute only in Chinese and Teochew will have no chance to get a sentence detected as Cantonese) (except if he start contribute a lot in Cantonese, then at the next refresh of the database, he will be considered as also contributing in Cantonese) So that the system is not "fixed" and will adapt itself with times (though not in a clever way as Jakov described)

спрятать ответы показать ответы

sacredceltic 31 июля 2012 г. 31 июля 2012 г., 15:22:17 UTC

link

Пермалинк

Clever. I see why it may pose a few problems with shorter sentences...

And do you systematically use all the Corpus or just an extract of it ? How often do you refresh your scoring tables from the updated corpus ?

спрятать ответы показать ответы

sysko 31 июля 2012 г. 31 июля 2012 г., 15:42:14 UTC

link

Пермалинк

All the corpus, as it may be difficult for me to decide what to extract from what

I will to refresh the data every week, as for the moment the script to regenerate the database is quite "memory" heavy though fast ( something like 4 minutes but use 2Go of Ram) and the "disk based" version i had before was far too long (8hours), so both are not suitable to be run on the server directly

By the way something interesting is that to get how efficient it is when doing my test, I run it with the data added during the week (so not yet included in the database of the detector), (that's how i give you this number of 98% percent). And sometimes you discovered that the sentence for which the detector is wrong (i.e different from what the human set) is because the human set the language wrong

So I will try to see if i can generate some kind of webpage with the sentences for which what the human (or previously the detector with less value) set differently from what the current detector guess, so a human can then use it to quickly find the sentences with a wrong flag

sysko 31 июля 2012 г. 31 июля 2012 г., 15:30:47 UTC

link

Пермалинк

oh I also forget

If at the end only one language get "unique" 5-grams , we don't compare the score and directly take that one
and also if we didn't succeed to detect the language with 5-grams, we switch to 3-grams and then to 2-grams (because for example languages using ideograms have very few "relevant" 5-grams yet, but far more 2-grams)

After possible way of improving the algorithm while keeping the same way of working (i.e simple change that would not require me to rethink it)

1 - maybe choose a better way of making "bonus" (for the moment the bonus is totally arbitrary and will maybe more efficient with more "scientifically" chosen value)

2 - maybe also use the "frequency" in which a user contribute in this or that language to apply a coeficient to the languages (for example if 99% of my sentences are in French but only 1% in English, if a result get "close" score, it has more chances to be French)

3 - maybe apply a malus to a language if for a given 5-gram this language does not appear (but the problem for some languages we do not have enough data, so we're maybe missing some "relevant" 5-gram for that language, so that will cause some false negatives)

4 - for "unique" languages , maybe apply a low-value filter, I mean if for a long sentences we get like 25 5-grams that appears only in esperanto and only 1 that appears only in spanish , for the moment we will continue and compare score etc. because we don't have only one language containing "unique" 5-grams

after of course as said there's certainly much more efficient algorithm, but then it willl require rewriting from "scratch" (well not everything, as the server part etc. will be the same)

спрятать ответы показать ответы

sacredceltic 31 июля 2012 г. 31 июля 2012 г., 15:59:42 UTC

link

Пермалинк

Very clever and room for improvement...

The idea of a page showing sentences with probable wrong flags is great (it should be sortable or filtered by author...), especially if correcting wrong flags also improves the whole scheme.

I agree the point 3 is tricky because many rare words are not yet in the corpus (as you well know for French...)

Probably the worst problem, polluting the stats, are nouns, especially if they are not translated (Mary, Tom,...). Maybe you should consider a way to exclude some noun-characteristic (and not language-characteristic) n-grams from the scoring...

спрятать ответы показать ответы

sysko 31 июля 2012 г. 31 июля 2012 г., 16:14:33 UTC

link

Пермалинк

actually noun or so does not polluate that much because they may appear with nearly the same absolute time in the corpus and by so will not priviligiate one or an other language

after for not-yet in the corpus word, for French or language with an alphabet that not that much a problem because most of the time they follow the same pattern (if we take for example the ~3000 words missing word in French among the 20 000 most commons words, most of them are conjugation form so with 5-grams like -avait etc. or adverb in -ement or rare word of a family ("innovations" for example may already have the 5-grams innov from innover/innovant , and "ation" from a lot of other words in -ation)

Good idea for sorting by author. (maybe by language then author, after maybe Jakov could help to add some Javascript to permit dynamic sorting)

artyom 1 августа 2012 г. 1 августа 2012 г., 17:31:15 UTC

link

Пермалинк

Just a small question...

How do you integrate the language detection that uses CppCMS with PHP Tatoeba?

спрятать ответы показать ответы

sysko 1 августа 2012 г. 1 августа 2012 г., 22:47:14 UTC

link

Пермалинк

actually the language detection is made by a separate service that run on a different server, so Tatoeba simply "connect" to it by doing a simple curl call.

sysko 1 августа 2012 г. 1 августа 2012 г., 22:58:22 UTC

link

Пермалинк

The goal is to make all these "non-vital" and "not hardly coupled to the core feature of adding/removing/editing sentences/links" features handled by standalone services for several reason:
1 - they can be reused by other projects (for the example the autocompletion you have on tags is a "standalone" server too, written in C by a friend of mine that use it for autocompletion of username etc.)
2 - in case of data corruption crash or whatsoever, that will not perturb the main service
3 - I can stop some services in case of nasty bugs (security hole, possible data corruption) or in case of high load on the server in order to focus on the main features , in the case i also "standalone" the wall , that will permit to still have the wall when i get some problems on the main features
4 - that can permit a simple "scalability" (if in the future we have several servers, we can put one with the core, one with the wall and language detection, an other with the autocompletion features etc.
5 - that can maybe easier to split the work on several people

after of course that maybe duplicate the code, so if i start to really generalize this process, maybe i will need a little more highlevel layer library upon cppcms ?

(I know my last points are not related to your questions, but as I'm not a "web architect" expert or whatsoever, I wanted for a long time to start a discussion on this with people with more experiences about my previous points.

спрятать ответы показать ответы

sacredceltic 2 августа 2012 г. 2 августа 2012 г., 09:30:02 UTC

link

Пермалинк

that sounds precisely like a sound architecture...

спрятать ответы показать ответы

al_ex_an_der 2 августа 2012 г. 2 августа 2012 г., 13:10:54 UTC

link

Пермалинк

http://tatoeba.org/deu/sentences/show/1746471

artyom 6 августа 2012 г. 6 августа 2012 г., 09:15:10 UTC

link

Пермалинк

I see,

I just think that the overhead of calling foreign service may be significant:

1. You need to hold entire thread (PHP process?) to wait for the response from the server.
2. You have all the RPC overhead (format/parse/create connection etc)

I'd rather combine all the code in a single project.

I mean: for example you already have the core of the tatoeba using CppCMS with the new and fast DB. Run it with CppCMS, on the other hand you can keep all the management code like wall, users management, etc using Cake PHP and replace it slowly in future (if needed at all)

I would use this strategy

Regards,
Artyom

Scott 2 августа 2012 г. 2 августа 2012 г., 16:00:18 UTC

link

Пермалинк

Thanks for your work Sysko. Seems to be working well.

спрятать ответы показать ответы

sysko 2 августа 2012 г. 2 августа 2012 г., 16:05:33 UTC

link

Пермалинк

great, actually it works thanks to all the contributions so we have all contribute to make it in an a way or an other :)

спрятать ответы показать ответы

Shadd 7 августа 2012 г. 7 августа 2012 г., 02:56:34 UTC

link

Пермалинк

What if it had some way to show the language he chose before the sentence is submitted? Perhaps it would require some different code than just PHP, but experimenting to stress the algorithm would be easier and we would be able to correct any mistake before it happened.
Just throwing the suggestion there: it would probably require a lot of work for such a small feature.

duran 5 августа 2012 г. 5 августа 2012 г., 10:26:09 UTC

link

Пермалинк

Thank you, Sysko. Works well.

Меню

Нужна помощь?

Разработчикам

О проекте