menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
sysko {{ icon }} keyboard_arrow_right

Profile

keyboard_arrow_right

Sentences

keyboard_arrow_right

Vocabulary

keyboard_arrow_right

Reviews

keyboard_arrow_right

Lists

keyboard_arrow_right

Favorites

keyboard_arrow_right

Comments

keyboard_arrow_right

Comments on sysko's sentences

keyboard_arrow_right

Wall messages

keyboard_arrow_right

Logs

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptions

translate

Translate sysko's sentences

sysko's messages on the Wall (total 1,397)

sysko sysko August 4, 2012 August 4, 2012 at 11:34:46 PM UTC link Permalink

我们也都很高兴有你 :)

关于检查句子的流程,高级编辑者可以加标签,如果你觉得一句中文句子是对的,你可以加一个"ok"标签,这样别的用户会发现这句话是对的,但是我同意现在我们差一个wiki为了说 从1h号到
10000号的中文句子都检查过了,还是“现在Sadhen在检查从10000到11000的句子"等等
我希望我快会加这种的特征。

(for the other part i take the initiative to translate into English as it may interest non-chinese speaker/learner)
(the previous part was about the checking process)
Basically in order to ease the checking process, Sadhen proposes for each natives to take care of one or more users to check and "OK" tag the sentences this way that will be easier to not have several people checking the same sentences while some being left apart.

in order to ease this, I will try to set up soon a sort of wiki on Tatoeba (open up in writing in a first time only for advanced members, in order to not have an other place to check for spammers), where we can centralized this kind of information.

sysko sysko August 4, 2012 August 4, 2012 at 11:18:55 PM UTC link Permalink

I deleted it, thanks again :)

sysko sysko August 3, 2012 August 3, 2012 at 5:49:56 PM UTC link Permalink

there's already more or less one, but the problem does not come at the registration, as someone can do it "humanly" and then let the script run. I did put some code based on the sessions, but seems it's not enough. I can try to put check before sending a private message, if there's already 5 times the same message, then to send me a private message to me too to check.

sysko sysko August 3, 2012 August 3, 2012 at 3:26:35 PM UTC link Permalink

I've deleted the messages and blocked the user, thank you to have notified me :)

sysko sysko August 2, 2012 August 2, 2012 at 4:05:33 PM UTC link Permalink

great, actually it works thanks to all the contributions so we have all contribute to make it in an a way or an other :)

sysko sysko August 1, 2012 August 1, 2012 at 10:58:22 PM UTC link Permalink

The goal is to make all these "non-vital" and "not hardly coupled to the core feature of adding/removing/editing sentences/links" features handled by standalone services for several reason:
1 - they can be reused by other projects (for the example the autocompletion you have on tags is a "standalone" server too, written in C by a friend of mine that use it for autocompletion of username etc.)
2 - in case of data corruption crash or whatsoever, that will not perturb the main service
3 - I can stop some services in case of nasty bugs (security hole, possible data corruption) or in case of high load on the server in order to focus on the main features , in the case i also "standalone" the wall , that will permit to still have the wall when i get some problems on the main features
4 - that can permit a simple "scalability" (if in the future we have several servers, we can put one with the core, one with the wall and language detection, an other with the autocompletion features etc.
5 - that can maybe easier to split the work on several people

after of course that maybe duplicate the code, so if i start to really generalize this process, maybe i will need a little more highlevel layer library upon cppcms ?

(I know my last points are not related to your questions, but as I'm not a "web architect" expert or whatsoever, I wanted for a long time to start a discussion on this with people with more experiences about my previous points.

sysko sysko August 1, 2012 August 1, 2012 at 10:47:14 PM UTC link Permalink

actually the language detection is made by a separate service that run on a different server, so Tatoeba simply "connect" to it by doing a simple curl call.

sysko sysko July 31, 2012 July 31, 2012 at 4:14:33 PM UTC link Permalink

actually noun or so does not polluate that much because they may appear with nearly the same absolute time in the corpus and by so will not priviligiate one or an other language

after for not-yet in the corpus word, for French or language with an alphabet that not that much a problem because most of the time they follow the same pattern (if we take for example the ~3000 words missing word in French among the 20 000 most commons words, most of them are conjugation form so with 5-grams like -avait etc. or adverb in -ement or rare word of a family ("innovations" for example may already have the 5-grams innov from innover/innovant , and "ation" from a lot of other words in -ation)

Good idea for sorting by author. (maybe by language then author, after maybe Jakov could help to add some Javascript to permit dynamic sorting)

sysko sysko July 31, 2012 July 31, 2012 at 3:42:14 PM UTC link Permalink

All the corpus, as it may be difficult for me to decide what to extract from what

I will to refresh the data every week, as for the moment the script to regenerate the database is quite "memory" heavy though fast ( something like 4 minutes but use 2Go of Ram) and the "disk based" version i had before was far too long (8hours), so both are not suitable to be run on the server directly

By the way something interesting is that to get how efficient it is when doing my test, I run it with the data added during the week (so not yet included in the database of the detector), (that's how i give you this number of 98% percent). And sometimes you discovered that the sentence for which the detector is wrong (i.e different from what the human set) is because the human set the language wrong

So I will try to see if i can generate some kind of webpage with the sentences for which what the human (or previously the detector with less value) set differently from what the current detector guess, so a human can then use it to quickly find the sentences with a wrong flag

sysko sysko July 31, 2012 July 31, 2012 at 3:30:47 PM UTC link Permalink

oh I also forget

If at the end only one language get "unique" 5-grams , we don't compare the score and directly take that one
and also if we didn't succeed to detect the language with 5-grams, we switch to 3-grams and then to 2-grams (because for example languages using ideograms have very few "relevant" 5-grams yet, but far more 2-grams)


After possible way of improving the algorithm while keeping the same way of working (i.e simple change that would not require me to rethink it)

1 - maybe choose a better way of making "bonus" (for the moment the bonus is totally arbitrary and will maybe more efficient with more "scientifically" chosen value)

2 - maybe also use the "frequency" in which a user contribute in this or that language to apply a coeficient to the languages (for example if 99% of my sentences are in French but only 1% in English, if a result get "close" score, it has more chances to be French)

3 - maybe apply a malus to a language if for a given 5-gram this language does not appear (but the problem for some languages we do not have enough data, so we're maybe missing some "relevant" 5-gram for that language, so that will cause some false negatives)

4 - for "unique" languages , maybe apply a low-value filter, I mean if for a long sentences we get like 25 5-grams that appears only in esperanto and only 1 that appears only in spanish , for the moment we will continue and compare score etc. because we don't have only one language containing "unique" 5-grams


after of course as said there's certainly much more efficient algorithm, but then it willl require rewriting from "scratch" (well not everything, as the server part etc. will be the same)

sysko sysko July 31, 2012 July 31, 2012 at 3:11:28 PM UTC link Permalink

no for two reasons
1 - I don't know have a "dictionnary" for every languages (especially for language like berbere , shanghainese, etc.)
2 - and I can not admit the space separator to "split" words as you know that will not work for Japanese, Chinese etc.

Moreover that will require using other data sources, which are not that easy to found under a "open" licence for a lot of languages. And that would have require maybe doing some "one language"-specific code

Instead I use a quite naive and simple method which is to cut the sentences in group of 2,3,4 and 5 characters (regardless if they are punctuations or so)

for example "Bonjour[narrow non break space]!" for the "5 by 5" table will be

Bonjo
onjou
njour
jour[non breakable space]
our[non-breakable]!

for each couple (language,5-gram) I keep two numbers the absolute number of times it appears in Tatoeba and the relative frequency in that language
(same for 4-grams,3-grams,2-grams)
I delete the couples that appear too fews times as they may come from "tricky sentence" (for example like a sentence that say
"你好" is how we say hello in Chinese" )

after this when you enter a sentence

this sentence is first splitted in group of 5 letters the same way as decribe above
I search all the languages that contain the first group

in a "absolute" table i keep the absolute score of this 5-gram in all the language it appears
in a relative table i keep the relative score
if this 5-gram appears in only one language (the reason why i delete those who appears too few), the absolute and relative score for this given 5-gram is multiplied by a bonus as a "unique" 5-gram is more relevant

we continue for each 5-gram, summings the score

after we get the two first languages in both score , and we keep the "1st language" that have more "distance" with the 2nd of its table

so that for example at the end if we have something like that


absolute table

1st: Mandarin: 10000
2st: Cantonese: 9800
[...]
nth: Teochew 500 (because we have really few sentence in this language)

relative score

1st Teochew : 0,50
2nd Mandarin: 0,01
[...]

We will detect it as Teochew

This way we both keep in account that some language are more likely to appears than other (an unknown user is more likely to contribute in German than in Low-saxon for example) but still give chance to "rare" languages

moreover if it is contributed by a known-user, we don't keep in the table the languages in which this user does not contribute (so someone who contribute only in Chinese and Teochew will have no chance to get a sentence detected as Cantonese) (except if he start contribute a lot in Cantonese, then at the next refresh of the database, he will be considered as also contributing in Cantonese) So that the system is not "fixed" and will adapt itself with times (though not in a clever way as Jakov described)


sysko sysko July 31, 2012 July 31, 2012 at 2:49:35 PM UTC link Permalink

for the moment the algorithm is really naive and no "learning" yet (away from using the new dump every week), my first goal was to get quickly something to replace the google API

I'm have some notions of machine learning, but I'm not familiar enough to quickly integrate it, but I think yes in the future we can imagine that a bad detection decrease the score of the n-gram of the "bad language" contained in the sentence, and increase the score for the language set by the user

I will try to put in the wiki page of the project the raw algorithm so that after everyone with computer science knowledge would be able to propose improvement as I don't pretend to be of any expertise on this field.

sysko sysko July 30, 2012 July 30, 2012 at 6:38:46 PM UTC link Permalink

In case some asked, it's very easily possible to generate a database with other "data", but i think tatoeba already give quite good result

sysko sysko July 30, 2012 July 30, 2012 at 6:36:37 PM UTC link Permalink

Language detector

Hello, good news

I've put back the language auto detection, it now use a homebrew tools based on statistics made with dump of the database (so it's a good example of what can be achieved with non-direct use of the data, you know, eat your own dog food etc.)

if there's no bug left, it's supposed to have a accuracy of around 98%, so as it's not 100% it still can detect badly sentences especially for very new users or people speaking a lot of languages belonging to the same group (e.g spanish, italian, portuguese) and adding tricky sentences.

So if you think a sentence was misdetected even though it was not a really tricky one, just post it there, this way i will try to see why, and improve the tools

for those interested by the code, it's actually a standalone API server (with for the moment only one API call), so it can be reused for other project needing automatic detection
the code is here
https://github.com/sysko/Tatodetect
under AGPL v3
it basically require cppcms lib , and boost locale, cppdb and sqlite3 lib
it compile fine on ubuntu 12.04 and on debian squid (you need to activate the c++11 support in the build option)
(I will write later more "precise" build instruction)

I will try to detail also later the basic mechanism behind, for those who can read code, nearly all the magic is done in src/model/Detects.cpp

sysko sysko July 30, 2012 July 30, 2012 at 6:21:20 PM UTC link Permalink

ok I corrected the bug, (see my last post)
now even your crafted one is detected as esperanto :)

sysko sysko July 30, 2012 July 30, 2012 at 6:20:15 PM UTC link Permalink

Ok it was a stupid mistake from me, i was storing a float value inside a integer variable for the "frequency score", so it was always 0 ...

now it should works fine

sysko sysko July 30, 2012 July 30, 2012 at 3:47:27 PM UTC link Permalink

yep seems i got a bug in the way it calculate "score" for each language to rank which one is the most probable, which make some germans sentences to be detected as Czech (as they certainly have common n-grams).
If you also get one or two badly detected sentences, I can work on them to figure where is the problem.

sysko sysko July 30, 2012 July 30, 2012 at 2:54:32 PM UTC link Permalink

ok merci, avec ses exemples je devrais arriver à trouver la cause.

sysko sysko July 30, 2012 July 30, 2012 at 2:50:52 PM UTC link Permalink

un nouvel arrivé n'apparaitra pas dans la table des utilisateurs et donc sera considérés comme si ils faisaient la requête sans le filtre "utilisateur"

sysko sysko July 30, 2012 July 30, 2012 at 1:33:06 PM UTC link Permalink

non il se base sur les contributions que tu as faites, filtrer des langues ou tu as fait que très très peu de contributions (du style une ou deux phrases, en fait tres exactement 350 caractères, purement arbitraire comme limites)

peux-tu me donner une ou deux phrases mal détecté, c'est assez étrange.