Thread #5941 - Tatoeba

at the end (when i will post the message saying it's finished:P) to make it simple everything will be moved to the sentence that is kept

so if you post a comment on a duplicate it will be moved to, same for link, and the history at the right it will be shown as if all the link were made to only this sentence

hide replies show replies

FeuDRenais May 1, 2011 May 1, 2011 at 3:31:08 AM UTC

flag

Report

link

Permalink

bug: the kept sentence appears twice in the link list

hide replies show replies

sysko May 1, 2011 May 1, 2011 at 6:09:02 AM UTC

flag

Report

link

Permalink

not a bug the script is not finished yet

hide replies show replies

sysko May 1, 2011 May 1, 2011 at 6:13:59 AM UTC

flag

Report

link

Permalink

we will regenerate the link list/sentence list just after :)

hide replies show replies

U2FS May 2, 2011 May 2, 2011 at 12:51:22 PM UTC

flag

Report

link

Permalink

awesome job so far. just wondering, is there a way one could know which duplicates sentences he/she owned ?

Swift May 1, 2011 May 1, 2011 at 11:17:39 AM UTC

flag

Report

link

Permalink

Awesome, indeed. Noticed a pretty deep dip. Do you know how many duplicates it merged?

hide replies show replies

sysko May 1, 2011 May 1, 2011 at 11:35:35 AM UTC

flag

Report

link

Permalink

around 13 000.

hide replies show replies

Swift May 1, 2011 May 1, 2011 at 1:49:14 PM UTC

flag

Report

link

Permalink

Great. Thanks for taking care of this!

hide replies show replies

xtofu80 May 1, 2011 May 1, 2011 at 7:28:32 PM UTC

flag

Report

link

Permalink

Currently Tatoeba is awfully slow. Is this due to the duplication removal operation, or is there another issue?

hide replies show replies

sysko May 1, 2011 May 1, 2011 at 8:04:11 PM UTC

flag

Report

link

Permalink

yup the script is currently updating the logs, sorry for the inconvenience, next time it should be a lot faster, as here the sentences hasn't been deduplicated for months

ludoviko May 1, 2011 May 1, 2011 at 11:47:44 PM UTC

flag

Report

link

Permalink

Wonderful, just very wonderful :-(((

You, sysko, do spend your time to take away 13 000 duplicates - but it seems that up to now you did not attack the cause of all these 13 000 duplicates. You wrote to me, some time ago, you wanted to make a program so that all translations linked via other languages would be linked - but I still can't see that. I have a look every week or so.

It is a pity that people translated 13 000 sentences - just to see them thrown away.

I am waiting. I would like to promote Tatoeba. But I won't write anything good about Tatoeba, when people translating sentences in Tatoeba will see them thrown away afterwards.

hide replies show replies

sysko May 2, 2011 May 2, 2011 at 12:01:53 AM UTC

flag

Report

link

Permalink

writing that new script actually did take me 20 minutes of my own personnal time, moreover the duplicates were already here, so handling it didn't delay the new version.

the main reason of the delay is that my personal life is __really__ busy, I'm really sorry to not be able to work as much as would like on tatoeba, but I do have job obligations too :(

hide replies show replies

ludoviko May 2, 2011 May 2, 2011 at 10:49:37 AM UTC

flag

Report

link

Permalink

So the question would be: How could we help you with your personal life? :-) Or with your job? Or, more realistic: It's now at least six months that I began to write messages on the wall and to you and to Trang about the problem. If you don't have time to solve that - is there no other solution? Someone else who could do it?

To constantly throw away user's contributions is just not very kind to them. Do you and Trang really want to continue that periodically? Do you want me to put a message on the wall about the subject every week or so? What solution do you have besides waiting for Godot?

ludoviko May 2, 2011 May 2, 2011 at 11:16:51 AM UTC

flag

Report

link

Permalink

A gigantic waste of time

If it is two minutes per translation, then it's 26 000 minutes for 13 000 translations, or 433 hours of work of the Tatoeba contributors thrown away, more than two month's work.

How about thinking about a rapid solution of the problem?

hide replies show replies

FeuDRenais May 4, 2011 May 4, 2011 at 1:13:59 AM UTC

flag

Report

link

Permalink

> A gigantic waste of time

Are you joking, or are you seriously critizing sysko? Because I don't think this is a fair argument to criticize by.

You could also say things like "if everyone gave a dollar a day to such and such charity, starvation in such and such country would not be a problem". And it'd be true (maybe), and you could make similar arguments for lots of situations. But sysko isn't a machine, and I don't think he thought of optimizing Tatoeba to within 1% global accuracy when he coded it.

In short: Thanks for all the work, sysko. I don't think 13,000 duplicates is as catastrophic as some people are making them out to be. We waste many more minutes each day on far less productive things. "Work thrown away"? Not at all.

hide replies show replies

FeuDRenais May 4, 2011 May 4, 2011 at 1:29:34 AM UTC

flag

Report

link

Permalink

Also, (840000-13000)/840000 = 98.5% efficiency. I don't know what everyone's standards are, but that's pretty damn good, IMO.

(or is the total number of duplicates over all time >> 13000?)

hide replies show replies

sacredceltic May 4, 2011 May 4, 2011 at 1:35:33 AM UTC

flag

Report

link

Permalink

>(or is the total number of duplicates over all time >> 13000?)

No, it's just the number since the last time a deduplication procedure was run, a few months ago...

hide replies show replies

FeuDRenais May 4, 2011 May 4, 2011 at 1:45:48 AM UTC

flag

Report

link

Permalink

I wonder what the worst case number is.

Let's just go nuts, and say that it's, I dunno, 80,000. So, roughly 10% of your contributions are duplicates.

I would say that *even then*, if you told me that 1 out of every 10 of my translations was a link instead of a brand new thing, I wouldn't throw a fit and start complaining about mass inefficiency. It's really not that big of a deal...

sacredceltic May 4, 2011 May 4, 2011 at 1:33:11 AM UTC

flag

Report

link

Permalink

Je pense également que si nous disposons, comme c'est de nouveau le cas, d'une bonne procédure de dédoublonnage, alors la création de doublons n'est pas si grave et elle crée des liens nouveaux. Je crée moi-même des doublons à cet effet.
Si des contributeurs se désolent que leurs phrases disparaissent, il est de leur responsabilité de contrôler que leurs phrases n'existent pas déjà dans la base avant de les insérer. Le guide du contributeur l'indique clairement.
Si par la suite, une nouvelle version permet de détecter un doublon dès sa création, ce sera un plus, mais ça n'empêchera pas le créateur d'un doublon de l'avoir tapé pour rien de toutes manières, puisqu'il faut qu'il existe pour être détecté...Donc le temps de création serait de toutes manières perdu.

hide replies show replies

FeuDRenais May 4, 2011 May 4, 2011 at 1:50:50 AM UTC

flag

Report

link

Permalink

ludoviko May 4, 2011 May 4, 2011 at 2:05:20 PM UTC

flag

Report

link

Permalink

La procédure de regarder, si il y a déjà une phrase traduite dans une certaine langue, est plutôt difficile. Je l'ai fait de temps en temps - mais cela dure et c'est compliqué. Est-ce que sacredceltic l'a déjà fait quelques fois pour certaines phrases?

Pour moi la solution est simple : Je ne traduis pas pour le moment. Et je n'envoie plus de messages sur Tatoeba dans les listes en espéranto.

Il s'agit d'ailleurs de beaucoup de phrases en espéranto. Je me souviens d'une liste de plus de 3000 doublons en espéranto. Plus probablement une grande partie des 13 000 phrases jetées il y a quelques jours. La proportion de FeuDRenais devrait considérer les langues. Quelle est le nombre de phrases en espéranto jetées jusqu'à présent?

hide replies show replies

FeuDRenais May 4, 2011 May 4, 2011 at 3:00:54 PM UTC

flag

Report

link

Permalink

> La proportion de FeuDRenais devrait considérer les langues.

"Toutes les langues sont égales sur Tatoeba."

sacredceltic May 4, 2011 May 4, 2011 at 3:12:11 PM UTC

flag

Report

link

Permalink

>Est-ce que sacredceltic l'a déjà fait quelques fois pour certaines phrases?

Je ne suis pas accrédité pour lier des traductions, mais lorsque j'avais cette possibilité, je le mettais à profit très largement entre l'anglais, le français, l'allemand, l'espagnol et...l'espéranto!
Maintenant, je crée souvent des doublons car c'est mon seul moyen de lier des phrases pour éviter qu'elles ne se retrouvent dans mes listes de phrases non traduites, ce qui m'énerve par-dessus tout...
Je préfère avoir des doublons qui seront fusionnés par la suite que de retrouver éternellement les mêmes phrases stupides dans mes listes à traduire...
D'ailleurs, pour faire un doublon, il suffit de faire copier-coller...ça ne me prend pas beaucoup plus de temps, c'est juste un peu plus bizarre...

TRANG May 4, 2011 May 4, 2011 at 1:39:50 AM UTC

flag

Report

link

Permalink

> A gigantic waste of time

I would like to suggest another point of view about this :)

Imagine we would do this instead: whenever a user adds a translation or new sentence, we check whether it already exists or not, and if it exists, instead of adding it, we just add the necessary links. Would you feel this is as much of a waste of time?
I'm going to guess not so much (at least I would feel that way). But well, in both case, people still have to spend time typing their translations... The duplicate removal script leads to the same kind of result as what I described, except it does it with some delay.

You also have to take into account that many (...most? ...all?) members in Tatoeba translate for practice and/or for learning. I can't speak in the name of everyone, but I can say at least that as far as I'm concerned, I wouldn't feel I've wasted my time if 10% or more of my contributions happened to be duplicates and were "merged". Because when I contribute... or rather contributed, I also got to practice and to learn a lot, and it was much less boring than doing typical translation homework for my language classes.
Of course, if others don't feel that way, I can perfectly understand, and they are free to stop contributing (they even should), until we release the new version. I know it's been a while since we've talked about it, but it's a pretty big task and it's not surprising that it's taking that long.

Now, regarding your question of a quick solution. Well, the duplicate removal script was the quick solution.

Another "meantime" but less quick solution is to get more people link sentences... but I think people have much more fun translating than linking.

Someone could also try making lists of "sentences that could potentially be linked but are not linked". All the information needed here to make such a list is here:
http://tatoeba.org/eng/download...mple_sentences
And from there we could probably do something to display the lists in some way and get people to create links more quickly.

And we still need a feature that allows a multi-links translation, so that those speaking more than 2 languages can add a translations that links to more than one language.

Finally, regarding "How could we help you with your personal life?", we simply (or not so simply) need to get more people involved in the project as a whole. As Swift mentioned it, I started writing about the matter this weekend: http://blog.tatoeba.org/2011/04...s-to-help.html
And I will be writing more...

Anyway, I hope that gives you a better picture :)

hide replies show replies

FeuDRenais May 4, 2011 May 4, 2011 at 1:49:40 AM UTC

flag

Report

link

Permalink

Out of curiosity, how much time would it take to do something as brute as run a similar_text() comparison between a new translation and all the existing sentences already in the database? Really, really long, I'm guessing?

hide replies show replies

sysko May 4, 2011 May 4, 2011 at 4:28:00 PM UTC

flag

Report

link

Permalink

I did it for you, the raw request (so even without the treating of it)

select 1 from sentences where text = "I love you."
1 row in set (1,24 sec)

so with 2000 contribution, you've lost 1 hours a day only for pure checking, moreover it does not solve the problem that you still need to type the sentence first before knowing the sentence already exist.

hide replies show replies

FeuDRenais May 4, 2011 May 4, 2011 at 6:39:00 PM UTC

flag

Report

link

Permalink

That's good speed though. If you partitioned by language and string length it probably would go fast enough not to bother a regular user too much.

But I agree that it doesn't really make sense if the duplicate script is going to do the job anyway.

hide replies show replies

sysko May 4, 2011 May 4, 2011 at 6:43:22 PM UTC

flag

Report

link

Permalink

the problem is that does not affect you but everyone. and on a server used more than 100% of it's capacity it means one hour less of doing other stuff.

hide replies show replies

sysko May 4, 2011 May 4, 2011 at 6:45:45 PM UTC

flag

Report

link

Permalink

because it does not mean "me and only me is going to wait 1.24 second nore" but "during 1.24s the server will do nothing except looking for my sentence"

Zifre May 4, 2011 May 4, 2011 at 7:46:35 PM UTC

flag

Report

link

Permalink

Could you add a column to the table with a MD5 hash? It would take a long time to generate for all existing sentences but it would greatly speed up searching for duplicates.

hide replies show replies

sysko May 4, 2011 May 4, 2011 at 8:06:03 PM UTC

flag

Report

link

Permalink

yup we can, actually md4 will be enough as we're not looking for secure hash, it's just i decided to use the few time I have to develop the new version, it's just i didn't realize few moths ago that there as an inherent tricky bug in mysql when the previous script was run with high load and concurency.

Now with this script that will be run at least once a week and that does not require modification in the current php code, i think the result will more or less the same than doing the check in real time with a hash (as I've said to FeuDrenais this kind of solution still require the user to type and validate the sentence)

but if Trang or someone wants to code it, sure.

Swift May 2, 2011 May 2, 2011 at 12:45:05 PM UTC

flag

Report

link

Permalink

Well, there are two issues here: The scope of the problem and the solutions to it.

Regarding the scope, I don't think these 13 000 translations took an average of a minute to translate. Generally the duplicates are the sort of simple sentences that are likely to be added twice to the corpus. Something like "I'm waiting for Godot" rather than the longer sentences.

From the logs, I reckon it's closer to a few seconds. Still a considerable amount of time and I don't think anyone is terribly happy with the situation.

At the same time, as Muiriel and I have pointed out, should the same sentence be entered twice as A and A' and then subsequently translated to different languages B and C, then merging A and A' will leave a single sentence A, with links to both B and C where nothing is lost.

It's also possible that B and C are entered in different languages and each of these is translated into the same language with sentences A and A'. When A and A' are merged, the end result is again a sentence A with links to B and C, indirectly linking the two latter. The inefficiency only arises if it takes longer to translate a sentence than to find, verify and link two sentences that don't share a translation. I think that's rather uncommon.

The final case is when sentence A is translated into B, which is translated into C and then back into the original language with sentence A'. Again here, there is value in that last translation as it is effectively provides a link between C and A. With the current Tatoeba system, I think making a quick duplicate is actually the simplest way to link such sentences (even with Zifre's awesome Greasemonkey script).

So, the problem isn't so great, in my mind, but it would still be nice to do something about it. The duplicate script currently takes a long time to run and slowed the server down considerably yesterday.

As Trang mentioned in her post from yesterday: http://blog.tatoeba.org/2011/04...s-to-help.html , there's still a while until collaborative work on the new version can begin. At the same time, anyone is free to take part in maintaining the current version or developing new features for it. One idea would be to create a "safe" contribution feature that would first check for related translations. This would perfectly fit translators who spend more time on translations.

But as Trang noted, there are loads of things related to Tatoeba that people can do, and the more people actively join the various "departments" the more pressure there will be on sysko to use his days off to code in some dingy little office than prance around the streets or hills in the fresh spring air. ;-)

Ough... I think I was going to conclude this by pulling these threads together into some nice conclusion, but this has been way to long and I need to get to my work so that I can enjoy some of the Icelandic spring (did I mention we had snow yesterday...?).

Menu

Need some help?

Developers

About