** Sentences deduplication **
We ran the deduplication script on the dev server last weekend. We found some other issues because of sentences such as #613428, which has for duplicates #2505312, #2505313, #2509131, and #2509135.
The issues were fixed. The script completed yesterday. It seems to deduplicate properly.
You can go and check some of the sentences that were deduplicated by looking at Horus' comments.
You can also check from the exports files of the dev database: http://downloads.tatoeba.org/dev/
and compare them with the ones from the prod: http://tatoeba.org/eng/downloads
There are 2 problems remaining.
1) The way things are logged can be confusing.
For instance: http://dev.tatoeba.org/eng/sentences/show/613428
Question: is it a problem for anyone if things are logged this way?
We could try and spend more time to make the logs more user friendly, or we could just leave it this way.
2) Comments are currently copied (instead of being moved) to the main sentence.
For instance: http://dev.tatoeba.org/eng/sentences/show/3550769.
If you go to the main sentence: http://dev.tatoeba.org/eng/sentences/show/1926402
you will see that there is a copy of the comment.
Question: do you prefer to have the comments on both the main and the duplicate sentence? Or only on the main sentence?
Thanks in advance for your feedbacks.
1) The logs look somewhat verbose but I'm sure I can live with it. The benefit of de-duplication exceeds this inconvenience greatly, I think. Besides, the subsequent runs of the script will generate much less log entries, I believe.
2) I think it's OK the way it is now.
I'm thrilled that we've gotten this far! Thanks to everyone who has made it happen.
I would definitely be in favor of running the script now and making improvements to the log later, if at all. It's already possible to get the information we need now, and we would all benefit from having those duplicates merged as soon as possible.
If and when we do make changes to the logging, is it possible to report the number of the sentence on which an operation was performed as well as the operation? For instance, instead of:
linked to #123456
could we write this?
#111111 linked to #123456
Optionally, that first number could be suppressed where it's identical to the sentence where the log is being displayed. But if it's displayed all the time, that's fine, too. It doesn't take up much horizontal space.
Whether comments are copied or not doesn't matter to me, so I would say the current behavior is fine.
> is it possible to report the number of the sentence on which an operation
> was performed as well as the operation?
It's already displayed, just not displayed clearly.
If you look for instance at my logs:
linked to #3649128
When the logs are on the sentence's page, all the operations only concern the sentence itself. Which is why the number sentence number is not displayed.
If there is a place outside of the sentence's page where the sentence id is not displayed, let me know.
I think we should take either intelligibility or exactitude. Probably the latter. Current pages are neither intelligible nor exact.
I'm still not happy with sentences like http://dev.tatoeba.org/fre/sentences/show/2144030 (as I wrote earlier in http://tatoeba.org/eng/wall/sho...essage_21045).
I agree with Alan and think that sentence numbers should be displayed. That would solve most problems. Besides, every addition of a sentence should be logged.
Eckhardtgabriel - Aug 27th 2010, 19:49
Why you don't take off your coat?
CK - Jan 14th 2013, 15:57
unlinked #483087 from #2143848
May 9th 2014, 22:22
[Comment to #483087]
Word order: Why don't you take off our coat?
I know this doesn't look very nice, but each sentence number has its history. You can't just pretend they were the same.
> I agree with Alan and think that sentence numbers should be displayed.
> That would solve most problems.
As I've replied to Alan, the sentence numbers are displayed on all the pages other than the sentence's page. On the sentence's page, all the operations only concern the sentence itself so the number would be repeated everywhere. You won't see something like "#2 linked to #3" on sentence #1.
Regarding the the comments, the question is whether it's necessary to fix, which means the deduplication could be delayed for an extra week, or if it's fine to have some data that is not completely coherent, and have the script run earlier.
As far as I'm concerned, I don't think there is a hurry to run the script so we could take the extra time, or an extra month if needed. But I know that people have been waiting for the deduplication for years, literally, and may just be alright with not having the perfect script.
In order to solve your problem, the simplest solution I can think of would be:
1) We don't copy the logs of the duplicates, we leave them there. Once the deduplication is done, we add a comment on the main sentence to indicate what were the duplicates found.
For instance. "This sentence had duplicates: #123, #456, #789. The duplicates have been deleted."
2) We do copy the comments of the duplicates, but we add a short message. "Comment copied from #123 because of duplicate merge".
I basically agree, but is Horus going to speak only English?
For the moment yes. There are definitely solutions to translate Horus' comments but I see nothing easy enough that would be worth investing the time and delaying the deduplication.
Considering that it will be tedious to change the text of the messages added by Horus after the deduplication, I'd like to have your opinion on what should be written.
Keep in mind that the text will be only in English, it won't be translated like the rest of the interface. So we should try to make it simple enough for people who do not speak English.
I'll post some suggestions below. Feel free to suggest something else.
1) The message on the deleted duplicate.
2) The message on the remaining sentence
3) The extra info in the comments that were copied.
Dans l'exemple que vous fournissez, vous indiquez avoir procédé à 4 suppressions, mais je ne vois la trace que de 3 doublons dans le log...D'où sort le 4e ?
Le dernier doublon n'apparaît pas dans les logs de la phrase restante car il n'était pas lié à cette phrase, contrairement aux 3 autres doublons.
seuls les doublons qui étaient liés sont tracés ? Mais alors comment fait-on pour savoir quels doublons ont été fusionnés ?
Le script, en l'état actuel, n'ajoute de commentaire que sur le doublon supprimé. Donc on peut systématiquement retrouver la phrase qui a été gardée à partir de la phrase supprimée, mais pas l'inverse. Cependant cela fait partie des améliorations qu'on voudrait implémenter avant de lancer la fusion des doublon pour de vrai sur Tatoeba.
Pouvoir retrouver les doublons supprimés me paraît en effet être un préalable essentiel...