Wall (7,137 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
deniko
yesterday
deniko
yesterday
frpzzd
yesterday
araneo
yesterday
deniko
yesterday
deniko
yesterday
deniko
yesterday
deniko
yesterday
deniko
yesterday
mraz
8 days ago

La La La.
Boring and stupid site.
I want my account deleted.
People here are stupid and boring too.
Fuck y'all.

co'o

Bon débarras !

> Boring and stupid site.
你的意見對我們很重要。 :)

Ciao a tutti dall'Italia!!
Sono contenta di aver scoperta Tatoeba..spacca!
Greetings from Italy...Tatoeba's terrific!
Salut de l'Italie...Tatoeba déchire!

benvenuto/a su tatoeba

Bien le bonjour. :)

Ciao, grazie del complimento* e benvenuto!
*http://tatoeba.org/eng/sentences/show/1752723 ☺ ☺ ☺

Benvenuta!

Language detector
Hello, good news
I've put back the language auto detection, it now use a homebrew tools based on statistics made with dump of the database (so it's a good example of what can be achieved with non-direct use of the data, you know, eat your own dog food etc.)
if there's no bug left, it's supposed to have a accuracy of around 98%, so as it's not 100% it still can detect badly sentences especially for very new users or people speaking a lot of languages belonging to the same group (e.g spanish, italian, portuguese) and adding tricky sentences.
So if you think a sentence was misdetected even though it was not a really tricky one, just post it there, this way i will try to see why, and improve the tools
for those interested by the code, it's actually a standalone API server (with for the moment only one API call), so it can be reused for other project needing automatic detection
the code is here
https://github.com/sysko/Tatodetect
under AGPL v3
it basically require cppcms lib , and boost locale, cppdb and sqlite3 lib
it compile fine on ubuntu 12.04 and on debian squid (you need to activate the c++11 support in the build option)
(I will write later more "precise" build instruction)
I will try to detail also later the basic mechanism behind, for those who can read code, nearly all the magic is done in src/model/Detects.cpp

In case some asked, it's very easily possible to generate a database with other "data", but i think tatoeba already give quite good result

Bedankt, Sysko! Het werkt heel goed.
Dankon, Sysko! Ĝi bonege funkcias.
Merci, Sysko! Cela marche tres bien.

Thanks for your work Sysko. Seems to be working well.

great, actually it works thanks to all the contributions so we have all contribute to make it in an a way or an other :)

What if it had some way to show the language he chose before the sentence is submitted? Perhaps it would require some different code than just PHP, but experimenting to stress the algorithm would be easier and we would be able to correct any mistake before it happened.
Just throwing the suggestion there: it would probably require a lot of work for such a small feature.

Just a small question...
How do you integrate the language detection that uses CppCMS with PHP Tatoeba?

The goal is to make all these "non-vital" and "not hardly coupled to the core feature of adding/removing/editing sentences/links" features handled by standalone services for several reason:
1 - they can be reused by other projects (for the example the autocompletion you have on tags is a "standalone" server too, written in C by a friend of mine that use it for autocompletion of username etc.)
2 - in case of data corruption crash or whatsoever, that will not perturb the main service
3 - I can stop some services in case of nasty bugs (security hole, possible data corruption) or in case of high load on the server in order to focus on the main features , in the case i also "standalone" the wall , that will permit to still have the wall when i get some problems on the main features
4 - that can permit a simple "scalability" (if in the future we have several servers, we can put one with the core, one with the wall and language detection, an other with the autocompletion features etc.
5 - that can maybe easier to split the work on several people
after of course that maybe duplicate the code, so if i start to really generalize this process, maybe i will need a little more highlevel layer library upon cppcms ?
(I know my last points are not related to your questions, but as I'm not a "web architect" expert or whatsoever, I wanted for a long time to start a discussion on this with people with more experiences about my previous points.

that sounds precisely like a sound architecture...


I see,
I just think that the overhead of calling foreign service may be significant:
1. You need to hold entire thread (PHP process?) to wait for the response from the server.
2. You have all the RPC overhead (format/parse/create connection etc)
I'd rather combine all the code in a single project.
I mean: for example you already have the core of the tatoeba using CppCMS with the new and fast DB. Run it with CppCMS, on the other hand you can keep all the management code like wall, users management, etc using Cake PHP and replace it slowly in future (if needed at all)
I would use this strategy
Regards,
Artyom

actually the language detection is made by a separate service that run on a different server, so Tatoeba simply "connect" to it by doing a simple curl call.

@sysko
do you maintain a probability value for each word in each language ?

no for two reasons
1 - I don't know have a "dictionnary" for every languages (especially for language like berbere , shanghainese, etc.)
2 - and I can not admit the space separator to "split" words as you know that will not work for Japanese, Chinese etc.
Moreover that will require using other data sources, which are not that easy to found under a "open" licence for a lot of languages. And that would have require maybe doing some "one language"-specific code
Instead I use a quite naive and simple method which is to cut the sentences in group of 2,3,4 and 5 characters (regardless if they are punctuations or so)
for example "Bonjour[narrow non break space]!" for the "5 by 5" table will be
Bonjo
onjou
njour
jour[non breakable space]
our[non-breakable]!
for each couple (language,5-gram) I keep two numbers the absolute number of times it appears in Tatoeba and the relative frequency in that language
(same for 4-grams,3-grams,2-grams)
I delete the couples that appear too fews times as they may come from "tricky sentence" (for example like a sentence that say
"你好" is how we say hello in Chinese" )
after this when you enter a sentence
this sentence is first splitted in group of 5 letters the same way as decribe above
I search all the languages that contain the first group
in a "absolute" table i keep the absolute score of this 5-gram in all the language it appears
in a relative table i keep the relative score
if this 5-gram appears in only one language (the reason why i delete those who appears too few), the absolute and relative score for this given 5-gram is multiplied by a bonus as a "unique" 5-gram is more relevant
we continue for each 5-gram, summings the score
after we get the two first languages in both score , and we keep the "1st language" that have more "distance" with the 2nd of its table
so that for example at the end if we have something like that
absolute table
1st: Mandarin: 10000
2st: Cantonese: 9800
[...]
nth: Teochew 500 (because we have really few sentence in this language)
relative score
1st Teochew : 0,50
2nd Mandarin: 0,01
[...]
We will detect it as Teochew
This way we both keep in account that some language are more likely to appears than other (an unknown user is more likely to contribute in German than in Low-saxon for example) but still give chance to "rare" languages
moreover if it is contributed by a known-user, we don't keep in the table the languages in which this user does not contribute (so someone who contribute only in Chinese and Teochew will have no chance to get a sentence detected as Cantonese) (except if he start contribute a lot in Cantonese, then at the next refresh of the database, he will be considered as also contributing in Cantonese) So that the system is not "fixed" and will adapt itself with times (though not in a clever way as Jakov described)

oh I also forget
If at the end only one language get "unique" 5-grams , we don't compare the score and directly take that one
and also if we didn't succeed to detect the language with 5-grams, we switch to 3-grams and then to 2-grams (because for example languages using ideograms have very few "relevant" 5-grams yet, but far more 2-grams)
After possible way of improving the algorithm while keeping the same way of working (i.e simple change that would not require me to rethink it)
1 - maybe choose a better way of making "bonus" (for the moment the bonus is totally arbitrary and will maybe more efficient with more "scientifically" chosen value)
2 - maybe also use the "frequency" in which a user contribute in this or that language to apply a coeficient to the languages (for example if 99% of my sentences are in French but only 1% in English, if a result get "close" score, it has more chances to be French)
3 - maybe apply a malus to a language if for a given 5-gram this language does not appear (but the problem for some languages we do not have enough data, so we're maybe missing some "relevant" 5-gram for that language, so that will cause some false negatives)
4 - for "unique" languages , maybe apply a low-value filter, I mean if for a long sentences we get like 25 5-grams that appears only in esperanto and only 1 that appears only in spanish , for the moment we will continue and compare score etc. because we don't have only one language containing "unique" 5-grams
after of course as said there's certainly much more efficient algorithm, but then it willl require rewriting from "scratch" (well not everything, as the server part etc. will be the same)

Very clever and room for improvement...
The idea of a page showing sentences with probable wrong flags is great (it should be sortable or filtered by author...), especially if correcting wrong flags also improves the whole scheme.
I agree the point 3 is tricky because many rare words are not yet in the corpus (as you well know for French...)
Probably the worst problem, polluting the stats, are nouns, especially if they are not translated (Mary, Tom,...). Maybe you should consider a way to exclude some noun-characteristic (and not language-characteristic) n-grams from the scoring...

actually noun or so does not polluate that much because they may appear with nearly the same absolute time in the corpus and by so will not priviligiate one or an other language
after for not-yet in the corpus word, for French or language with an alphabet that not that much a problem because most of the time they follow the same pattern (if we take for example the ~3000 words missing word in French among the 20 000 most commons words, most of them are conjugation form so with 5-grams like -avait etc. or adverb in -ement or rare word of a family ("innovations" for example may already have the 5-grams innov from innover/innovant , and "ation" from a lot of other words in -ation)
Good idea for sorting by author. (maybe by language then author, after maybe Jakov could help to add some Javascript to permit dynamic sorting)

Clever. I see why it may pose a few problems with shorter sentences...
And do you systematically use all the Corpus or just an extract of it ? How often do you refresh your scoring tables from the updated corpus ?

All the corpus, as it may be difficult for me to decide what to extract from what
I will to refresh the data every week, as for the moment the script to regenerate the database is quite "memory" heavy though fast ( something like 4 minutes but use 2Go of Ram) and the "disk based" version i had before was far too long (8hours), so both are not suitable to be run on the server directly
By the way something interesting is that to get how efficient it is when doing my test, I run it with the data added during the week (so not yet included in the database of the detector), (that's how i give you this number of 98% percent). And sometimes you discovered that the sentence for which the detector is wrong (i.e different from what the human set) is because the human set the language wrong
So I will try to see if i can generate some kind of webpage with the sentences for which what the human (or previously the detector with less value) set differently from what the current detector guess, so a human can then use it to quickly find the sentences with a wrong flag

*Awsome*!
Would it theoretically make a difference for the software to know which sentences it detected wrongly, or would the detection improve by just taking a new dump from the database (including the sentences that were detected wrongly and manually corrected).
What I mean is: Would it make sense to have a call to the API saying "wrong guess" when someone changes the flag by hand? Would the memory of what was guessed wrongly improve the program?

for the moment the algorithm is really naive and no "learning" yet (away from using the new dump every week), my first goal was to get quickly something to replace the google API
I'm have some notions of machine learning, but I'm not familiar enough to quickly integrate it, but I think yes in the future we can imagine that a bad detection decrease the score of the n-gram of the "bad language" contained in the sentence, and increase the score for the language set by the user
I will try to put in the wiki page of the project the raw algorithm so that after everyone with computer science knowledge would be able to propose improvement as I don't pretend to be of any expertise on this field.

Excuse me but it's not really clear to me where I can find it or what it is, even though I got how it works. What's its purpose for the project?

The purpose is to automatically detect the language of the sentence you enter in Tatoeba, rather than to select which language it is on your own. It is very helpful to Tatoeba because many people forget to change the flag of their sentences to the right one, when they contribute in multiple languages.
To solve this, you select "automatic detection" in the languages list for your sentence, once (afterwards, that choice is automatically retained), and here you go, spoiled child...
Furthermore, sysko designed it as an API that can be used from anywhere to detect the language of any sentence, based on Tatoeba sentences statistics...
That means Tatoeba's corpus might be used as a base for language detection for other services.

That's great!!! I have waited this feature for a long time. Sometimes, the language you want to translate into is 6 miles down the list and it's difficult to use the language list when you keep changing languages. Now, I can shift from any language to any language without wasting much time ^v^ Thank you Sysko :-)

>Sometimes, the language you want to translate into is 6 miles down the list
This one is yet a different issue: You can also set parameters in your profile to limit the languages that you see and use.
go to http://tatoeba.org/fre/user/settings and enter languages 3 letters ISO codes, separated by commas, of the languages you want to use : < ber, eng, ara, fre > for example...
This way, the list is limited...
Tatoeba is soooooo resourceful !

Thank you very much ^v^

Firefox allows you to "find as you type" when the drop-down list is active. Just type the first letters of the language.

I do that, and that's why I use the German interface:
German is 'D'
English is 'E'
Esperanto is 'EE'
Japanese is 'J'
and Spanish is 'SP'

yep, but that doesn't filter the translations you don't want to see...
With now over 100 languages, and given all the variants, a sentence can sometimes have 100+ translations. Profile parameters enable you to filter them, not only to reduce the list...

Oh, now I see, thanks.


Bravo et merci, sysko ! Ça serait super si Tatoeba devenait une référence de la détection de langues !

Thank you, Sysko. Works well.

很高兴,最近成为了Tatoeba的高级编辑者。不知道各位在这个网站上贡献自己时间的初衷是什么,对我来说,在Tatoeba上编辑是为了学习语言。虽然自己也不清楚这对语言学习到底有多大的帮助。很多时候,添加句子、翻译句子……其实都是在打发时间。就像是漫无目的地浏览网页,看看Facebook和Twitter上有什么新鲜事。
我对Tatoeba上检查句子的流程还是很疑惑的。通常我都是偶尔在首页看到一些有问题的句子才会去评论,指出其中不妥当的地方。前几天,我试着从第一句中文开始检查,但是发现工作量着实不小。而且很多句子是别的用户检查过的,既然已经是完全正确的句子,再去检查也是浪费时间。
不过,我想针对用户来检查句子是个好想法。就像维基百科上的监视列表一样,每个Tatoeba的用户都有自己负责的用户,而且自己记录着自己检查那些用户的进度。我想这样可以更好地保证句子的正确性。
正在学习中文的朋友们,如果你们愿意,我可以帮助你们检查你们的中文。我会在我的资料中保存检查你们中文语句的进度,就像这样(下面是我翻译一些用户世界语Esperanto的进度):
先努力翻译这个用户的英文和世界语句子
http://tatoeba.org/epo/activiti...jameno/page:17
接下来是这个用户:
http://tatoeba.org/chi/activiti...ofi/epo/page:2
http://tatoeba.org/chi/activiti...f/ludoviko/epo
http://tatoeba.org/chi/activiti...s_of/etala/epo
http://tatoeba.org/chi/activiti...f/Manfredo/epo

我们也都很高兴有你 :)
关于检查句子的流程,高级编辑者可以加标签,如果你觉得一句中文句子是对的,你可以加一个"ok"标签,这样别的用户会发现这句话是对的,但是我同意现在我们差一个wiki为了说 从1h号到
10000号的中文句子都检查过了,还是“现在Sadhen在检查从10000到11000的句子"等等
我希望我快会加这种的特征。
(for the other part i take the initiative to translate into English as it may interest non-chinese speaker/learner)
(the previous part was about the checking process)
Basically in order to ease the checking process, Sadhen proposes for each natives to take care of one or more users to check and "OK" tag the sentences this way that will be easier to not have several people checking the same sentences while some being left apart.
in order to ease this, I will try to set up soon a sort of wiki on Tatoeba (open up in writing in a first time only for advanced members, in order to not have an other place to check for spammers), where we can centralized this kind of information.

by the way as a Chinese learner I'll be glad you check mine :), there's not that many because those older than 6 months (and i didn't contribute that much this 6 last months) already get checked by a Chinese friend of mine (the user fucongcong there)

new spam message by http://tatoeba.org/ita/user/profile/rita207

I've received it too...

I have just received it.
Imir-a kan ay t-id-ḍḍfeɣ.

Ankaŭ mi ricevis trudmesaĝon de tiu spamsendanto (sed mi trovis ĝin, enhavantan tekston en la ĉeĥa, interesa ☺).

Oh Rita, Rita, lovely Rita,why do you do all this? ☺

+1

I deleted it, thanks again :)

spam message by http://tatoeba.org/ita/user/profile/rose102

I've deleted the messages and blocked the user, thank you to have notified me :)

@sysko, i have a suggestion for spam users: why don't you put captchas for new users creation? i believe it would really help with this issue

there's already more or less one, but the problem does not come at the registration, as someone can do it "humanly" and then let the script run. I did put some code based on the sessions, but seems it's not enough. I can try to put check before sending a private message, if there's already 5 times the same message, then to send me a private message to me too to check.

ROMANCE SCAM WARNING: If rose102, "a loving and caring lady" is going to send a message to you too, you shouldn't answer. It's just a trick to convince you in several steps to send money in the end.

+1

that's good to be a memmber a mong you.

Ihi imi llan Yiqbayliyen niḍen da, ad farseγ tagnit ad ken-sseqsiγ:
Amek ara d-nini "aslif" akked "tawellit" s tenglizit ?
- "aslif" d argaz n weltma-s n tmeṭṭut-iw
- "tawellit" d tameṭṭut yezzewjen d-yerezzfen mi tebγa γer wexxam n yimawlan-is.

Llan ula d Icelḥiyen a Djef :-) Ur ttecceḍ ;-)
Deg wayen yerzan awen ayɣef d-tesseqsaḍ:
Aslif (islifen): my sister-in-law's husband.
Yusef d aslif-inu - Yusef is my sister-in-law's husband.
Tawellit (tiwelliyin): married daughter/sister.
Nɛennu tiwelliyin-nneɣ 1- We visit our married daughters.
2- We visit our married sisters.

Azul fell-awen

Azul a Hocine, amek telliḍ?

svp tari s tifinagh

Nekk ed bahra n medden yaḍnin ur nẓadr ad nari s tfinaɣ. Aselkim-inu (حاسوب) ur yeẓdar ad yari s tfinaɣ. Ur ssineɣ maɣef.

nkkni nra ad ɣurnɣ tili yat tutlayt tamaziɣt tatrart

kcm ɣr wansa agi iɣ tbɣit ad tlmedt tifinaɣ
http://www.youtube.com/watch?v=...9&feature=plcp

A sel a argaz n lɛali: nekk ssneɣ tifinaɣ, maca ass-a llan bahra n medden ur ẓdarn ad tent-ẓren (s wallen-nsen) ɣ uselkim. Ar d-ttbanen ɣas imkuẓen (مربعات). Ta tga tamukrist d taxatart. Ixessa ad d-naf tifrat. Maca yezmer ad yili yiles dar-s sin n yigemmayen. Llan bahra n yilsawen ɣ umaḍal dar-sen sin n yigemmayen nɣed ugar: taserbit, takurdit, taswaḥilit, tahawsat, atg. Ari s ugemmay ay trid a gma. Tanemmirt ^v^

tanmmirt ik a gma amstan

ⵎⴰⵎⴻⴽ ⴰⵔⴰ ⵏⴰⵔⵉ: ?
ɣ (غ)
ɛ (ع)

ⵢⴰⵀ

F umya ;-)
Tanemmirt i keyy a gma :-)

Ah, dɣik awd nekk ẓdarɣ ad ẓreɣ isekkilen n tfinaɣ!!! Feṛḥeɣ bahra s waya!!! Ta d tikkelt tamezwarut mani rad zḍarɣ ad ten-ẓreɣ ɣ uselkim-inu ;-)

ⴰⴷ ⵏⴰⵔⵉ ⵙ ⵓⴳⴻⵎⵎⴰⵢ :-)

Ad kecmeɣ s azday-ad ameggaru, ad t-armeɣ.

Tifino for write Tifinagh
[Francais]
Tifino est un logiciel crée par Ibrahim Bidi, ce programme d'écriture en tifinagh
est différent des polices Tifinagh déjà connues ,est qui ne fonctionnent que sur les document office indépendamment de la connexion a internet.
Avec "Tifino" dorénavant ,vous pouvez écrire en Tifinagh sur les forums de vos sites préférés ,
envoyer des e-mails ,chater...etc ..tous en Tifinagh
voila la technique et mode d'emploi ,suivez les instructions suivantes:
*)télécharger ce petit logiciel Tifino Lite:
code.google.com/p/tifino/
Lien Telechargement direct: http://tifino.googlecode.com/fi...ino%20lite.exe
*)dans la barre des langues choisissez Afrikaans
*)tapez qq chose et oui c'est du tifinagh

kcm ɣr wansa agi ad degs tarit s iskkiln n tifinaɣ
http://www.amazighnews.net/Clavier-Amazigh.html

Tanemmirt. Ssneɣ mani rad d-afeɣ anasiw (لوحة مفاتيح) n tfinaɣ, maca isekkilen-nni ur d-tteffɣen (ur d-ttbanen) ɣ uselkim-inu. Nekk ẓdarɣ ad ten-ariɣ, maca ur ẓdarɣ ad ten-ẓreɣ, ur d-ttbanen.

A sel: keyyin kemmel ari s tfinaɣ, nekkni ad nkemmel ad nari s ugemmay alatini as asmi ara d-naf mamek rad nesmun tira-nneɣ.

azul fllawn

Azul fell-ak a Abrid
Anṣuf yes-k ɣer Tatoeba
Tella yagi tmaziɣt deg Tatoeba.
Hello Abrid
Welcome to Tatoeba
Amazigh (Berber) already exists on Tatoeba.

Azul d ameqran fell-ak a gma Lḥusin,
Hellllllllo Hocine
Gran saludo Hocine
Hallo Hocine
أهلا بك يا حسين،
Anṣuf yes-k ɣer Tatoeba
Welcome to Tatoeba
Bienvenido a Tatoeba
Willkommen zu Tatoeba
مرحبا بك إلى تاتويبا
^v^
Hocine d Amaziɣ aneggaru ay d-ikecmen ɣer Tatoeba
Hocine is the latest Amazigh-speaker to join Tatoeba

Extrange linking.
http://tatoeba.org/spa/sentences/show/1647335
0 direct link but 2 indirect links ¿?

Something extrange with links
http://tatoeba.org/spa/sentences/show/1743412
But cliking on english sentece, interlink disapear
http://tatoeba.org/spa/sentences/show/1633619

Maybe you don't have selected those languages in your settings?

Yes, it's
Estaba haciendo pruebas en la configuración y no daba cuenta que afectaba.
Gracias :)