I'm going to make some maintenance operations on the database, so tatoeba will be down during some minutes, it shouldn't be long.
if the number of sentences is decreasing for the following hours it's normal , the duplicates are being removed
How does this handle comments, links, etc.? Will links made by the system show up in the logs as if they were made by whoever made the link on the duplicate sentence that is being removed?
at the end (when i will post the message saying it's finished:P) to make it simple everything will be moved to the sentence that is kept
so if you post a comment on a duplicate it will be moved to, same for link, and the history at the right it will be shown as if all the link were made to only this sentence
bug: the kept sentence appears twice in the link list
not a bug the script is not finished yet
we will regenerate the link list/sentence list just after :)
awesome job so far. just wondering, is there a way one could know which duplicates sentences he/she owned ?
Awesome, indeed. Noticed a pretty deep dip. Do you know how many duplicates it merged?
around 13 000.
Great. Thanks for taking care of this!
Currently Tatoeba is awfully slow. Is this due to the duplication removal operation, or is there another issue?
yup the script is currently updating the logs, sorry for the inconvenience, next time it should be a lot faster, as here the sentences hasn't been deduplicated for months
Wonderful, just very wonderful :-(((
You, sysko, do spend your time to take away 13 000 duplicates - but it seems that up to now you did not attack the cause of all these 13 000 duplicates. You wrote to me, some time ago, you wanted to make a program so that all translations linked via other languages would be linked - but I still can't see that. I have a look every week or so.
It is a pity that people translated 13 000 sentences - just to see them thrown away.
I am waiting. I would like to promote Tatoeba. But I won't write anything good about Tatoeba, when people translating sentences in Tatoeba will see them thrown away afterwards.
writing that new script actually did take me 20 minutes of my own personnal time, moreover the duplicates were already here, so handling it didn't delay the new version.
the main reason of the delay is that my personal life is __really__ busy, I'm really sorry to not be able to work as much as would like on tatoeba, but I do have job obligations too :(
So the question would be: How could we help you with your personal life? :-) Or with your job? Or, more realistic: It's now at least six months that I began to write messages on the wall and to you and to Trang about the problem. If you don't have time to solve that - is there no other solution? Someone else who could do it?
To constantly throw away user's contributions is just not very kind to them. Do you and Trang really want to continue that periodically? Do you want me to put a message on the wall about the subject every week or so? What solution do you have besides waiting for Godot?
A gigantic waste of time
If it is two minutes per translation, then it's 26 000 minutes for 13 000 translations, or 433 hours of work of the Tatoeba contributors thrown away, more than two month's work.
How about thinking about a rapid solution of the problem?
> A gigantic waste of time
Are you joking, or are you seriously critizing sysko? Because I don't think this is a fair argument to criticize by.
You could also say things like "if everyone gave a dollar a day to such and such charity, starvation in such and such country would not be a problem". And it'd be true (maybe), and you could make similar arguments for lots of situations. But sysko isn't a machine, and I don't think he thought of optimizing Tatoeba to within 1% global accuracy when he coded it.
In short: Thanks for all the work, sysko. I don't think 13,000 duplicates is as catastrophic as some people are making them out to be. We waste many more minutes each day on far less productive things. "Work thrown away"? Not at all.
Also, (840000-13000)/840000 = 98.5% efficiency. I don't know what everyone's standards are, but that's pretty damn good, IMO.
(or is the total number of duplicates over all time >> 13000?)
>(or is the total number of duplicates over all time >> 13000?)
No, it's just the number since the last time a deduplication procedure was run, a few months ago...
I wonder what the worst case number is.
Let's just go nuts, and say that it's, I dunno, 80,000. So, roughly 10% of your contributions are duplicates.
I would say that *even then*, if you told me that 1 out of every 10 of my translations was a link instead of a brand new thing, I wouldn't throw a fit and start complaining about mass inefficiency. It's really not that big of a deal...
Je pense également que si nous disposons, comme c'est de nouveau le cas, d'une bonne procédure de dédoublonnage, alors la création de doublons n'est pas si grave et elle crée des liens nouveaux. Je crée moi-même des doublons à cet effet.
Si des contributeurs se désolent que leurs phrases disparaissent, il est de leur responsabilité de contrôler que leurs phrases n'existent pas déjà dans la base avant de les insérer. Le guide du contributeur l'indique clairement.
Si par la suite, une nouvelle version permet de détecter un doublon dès sa création, ce sera un plus, mais ça n'empêchera pas le créateur d'un doublon de l'avoir tapé pour rien de toutes manières, puisqu'il faut qu'il existe pour être détecté...Donc le temps de création serait de toutes manières perdu.
La procédure de regarder, si il y a déjà une phrase traduite dans une certaine langue, est plutôt difficile. Je l'ai fait de temps en temps - mais cela dure et c'est compliqué. Est-ce que sacredceltic l'a déjà fait quelques fois pour certaines phrases?
Pour moi la solution est simple : Je ne traduis pas pour le moment. Et je n'envoie plus de messages sur Tatoeba dans les listes en espéranto.
Il s'agit d'ailleurs de beaucoup de phrases en espéranto. Je me souviens d'une liste de plus de 3000 doublons en espéranto. Plus probablement une grande partie des 13 000 phrases jetées il y a quelques jours. La proportion de FeuDRenais devrait considérer les langues. Quelle est le nombre de phrases en espéranto jetées jusqu'à présent?
> La proportion de FeuDRenais devrait considérer les langues.
"Toutes les langues sont égales sur Tatoeba."
>Est-ce que sacredceltic l'a déjà fait quelques fois pour certaines phrases?
Je ne suis pas accrédité pour lier des traductions, mais lorsque j'avais cette possibilité, je le mettais à profit très largement entre l'anglais, le français, l'allemand, l'espagnol et...l'espéranto!
Maintenant, je crée souvent des doublons car c'est mon seul moyen de lier des phrases pour éviter qu'elles ne se retrouvent dans mes listes de phrases non traduites, ce qui m'énerve par-dessus tout...
Je préfère avoir des doublons qui seront fusionnés par la suite que de retrouver éternellement les mêmes phrases stupides dans mes listes à traduire...
D'ailleurs, pour faire un doublon, il suffit de faire copier-coller...ça ne me prend pas beaucoup plus de temps, c'est juste un peu plus bizarre...
> A gigantic waste of time
I would like to suggest another point of view about this :)
Imagine we would do this instead: whenever a user adds a translation or new sentence, we check whether it already exists or not, and if it exists, instead of adding it, we just add the necessary links. Would you feel this is as much of a waste of time?
I'm going to guess not so much (at least I would feel that way). But well, in both case, people still have to spend time typing their translations... The duplicate removal script leads to the same kind of result as what I described, except it does it with some delay.
You also have to take into account that many (...most? ...all?) members in Tatoeba translate for practice and/or for learning. I can't speak in the name of everyone, but I can say at least that as far as I'm concerned, I wouldn't feel I've wasted my time if 10% or more of my contributions happened to be duplicates and were "merged". Because when I contribute... or rather contributed, I also got to practice and to learn a lot, and it was much less boring than doing typical translation homework for my language classes.
Of course, if others don't feel that way, I can perfectly understand, and they are free to stop contributing (they even should), until we release the new version. I know it's been a while since we've talked about it, but it's a pretty big task and it's not surprising that it's taking that long.
Now, regarding your question of a quick solution. Well, the duplicate removal script was the quick solution.
Another "meantime" but less quick solution is to get more people link sentences... but I think people have much more fun translating than linking.
Someone could also try making lists of "sentences that could potentially be linked but are not linked". All the information needed here to make such a list is here:
And from there we could probably do something to display the lists in some way and get people to create links more quickly.
And we still need a feature that allows a multi-links translation, so that those speaking more than 2 languages can add a translations that links to more than one language.
Finally, regarding "How could we help you with your personal life?", we simply (or not so simply) need to get more people involved in the project as a whole. As Swift mentioned it, I started writing about the matter this weekend: http://blog.tatoeba.org/2011/04...s-to-help.html
And I will be writing more...
Anyway, I hope that gives you a better picture :)
Out of curiosity, how much time would it take to do something as brute as run a similar_text() comparison between a new translation and all the existing sentences already in the database? Really, really long, I'm guessing?
I did it for you, the raw request (so even without the treating of it)
select 1 from sentences where text = "I love you."
1 row in set (1,24 sec)
so with 2000 contribution, you've lost 1 hours a day only for pure checking, moreover it does not solve the problem that you still need to type the sentence first before knowing the sentence already exist.
That's good speed though. If you partitioned by language and string length it probably would go fast enough not to bother a regular user too much.
But I agree that it doesn't really make sense if the duplicate script is going to do the job anyway.
the problem is that does not affect you but everyone. and on a server used more than 100% of it's capacity it means one hour less of doing other stuff.
because it does not mean "me and only me is going to wait 1.24 second nore" but "during 1.24s the server will do nothing except looking for my sentence"
Could you add a column to the table with a MD5 hash? It would take a long time to generate for all existing sentences but it would greatly speed up searching for duplicates.
yup we can, actually md4 will be enough as we're not looking for secure hash, it's just i decided to use the few time I have to develop the new version, it's just i didn't realize few moths ago that there as an inherent tricky bug in mysql when the previous script was run with high load and concurency.
Now with this script that will be run at least once a week and that does not require modification in the current php code, i think the result will more or less the same than doing the check in real time with a hash (as I've said to FeuDrenais this kind of solution still require the user to type and validate the sentence)
but if Trang or someone wants to code it, sure.
Well, there are two issues here: The scope of the problem and the solutions to it.
Regarding the scope, I don't think these 13 000 translations took an average of a minute to translate. Generally the duplicates are the sort of simple sentences that are likely to be added twice to the corpus. Something like "I'm waiting for Godot" rather than the longer sentences.
From the logs, I reckon it's closer to a few seconds. Still a considerable amount of time and I don't think anyone is terribly happy with the situation.
At the same time, as Muiriel and I have pointed out, should the same sentence be entered twice as A and A' and then subsequently translated to different languages B and C, then merging A and A' will leave a single sentence A, with links to both B and C where nothing is lost.
It's also possible that B and C are entered in different languages and each of these is translated into the same language with sentences A and A'. When A and A' are merged, the end result is again a sentence A with links to B and C, indirectly linking the two latter. The inefficiency only arises if it takes longer to translate a sentence than to find, verify and link two sentences that don't share a translation. I think that's rather uncommon.
The final case is when sentence A is translated into B, which is translated into C and then back into the original language with sentence A'. Again here, there is value in that last translation as it is effectively provides a link between C and A. With the current Tatoeba system, I think making a quick duplicate is actually the simplest way to link such sentences (even with Zifre's awesome Greasemonkey script).
So, the problem isn't so great, in my mind, but it would still be nice to do something about it. The duplicate script currently takes a long time to run and slowed the server down considerably yesterday.
As Trang mentioned in her post from yesterday: http://blog.tatoeba.org/2011/04...s-to-help.html , there's still a while until collaborative work on the new version can begin. At the same time, anyone is free to take part in maintaining the current version or developing new features for it. One idea would be to create a "safe" contribution feature that would first check for related translations. This would perfectly fit translators who spend more time on translations.
But as Trang noted, there are loads of things related to Tatoeba that people can do, and the more people actively join the various "departments" the more pressure there will be on sysko to use his days off to code in some dingy little office than prance around the streets or hills in the fresh spring air. ;-)
Ough... I think I was going to conclude this by pulling these threads together into some nice conclusion, but this has been way to long and I need to get to my work so that I can enjoy some of the Icelandic spring (did I mention we had snow yesterday...?).