メニュー

** Update on http://dev.tatoeba.org/ **
We fixed some severe problems in the merging script and performed a new deduplication pass. Please have a look at how duplicates have been merged.
A few tips. You can use the following pages to pick up some merged duplicates:
• merging bot contribution log: http://dev.tatoeba.org/contributions/of_user/Horus
• comments from the merging bot: http://dev.tatoeba.org/sentence.../of_user/Horus
When on a sentence page, add or remove the “dev.” part of the URL to go back and forth the non-merged and merged corpus for comparison.

this time it seems that the script worked properly, even for duplicates linked together

Cette fois-ci, je perds 111.703 - 110755 = 948 (0,85%)
Tandis que le français perd 248.522 - 245.926 = 2.596 (1,04%)
C'est un peu plus cohérent car, cette fois-ci, je ne perds pas davantage de phrases que le corpus français !
Désolé d'avoir produit un peu moins de 1% de doublons, mais vu la quantité de contributeurs et le nombre de cas disjoints (en anglais : it's, this is, that is, that's, ...) je reste assez fier de mes résultats.
Or l'espéranto passe de 362.338 à 353.058, soit une diminution de 2,56%
L'italien passe de 284.911 à 277.499, soit une diminution de 2,60%
Le russe passe de 222.448 à 215185, soit une diminution de 3,26%
C'est en phase avec les annonces hebdomadaires de CK.
On n'est pas à l'abri de cas individuels, mais, statistiquement, ça ma paraît tenir la route et je tamponne donc !

I see some problems, mainly in the logs.
1. According to the log, many sentences are linked to itself (when duplicates had been linked to each other). For example, http://dev.tatoeba.org/eng/sentences/show/1179555 is "linked to #1179555", which is obviously strange.
2. When deleting a sentence, Horus pretends that he only deletes a sentence, while he actually deletes a sentence and links. As a result, the number of "linked to" minus the number of "unlinked from" in the log doesn't match the number of links. Again, this seems to happen only when duplicates had been linked to each other. Maybe it's better to delete the logs about these links between duplicates altogether.
3. Horus also doesn't tell us when he links sentences. So when we, for example, see the log of http://dev.tatoeba.org/eng/sentences/show/486749, it looks as if it's still linked to #3466390, even though in fact Horus deleted this link and linked the sentence to #2178443 instead. Of course #3466390 and #2178443 are the same thing, but I think it's a little problematic that we can't easily know who linked the sentence to #2178443 and which sentence marafon linked to (no sentence is highlighted when I click on "linked to #3466390".)
However, I'm not sure what's the best thing to do. I'm tempted to simply change the log to "marafon - Sep 2nd 2014, 12:52 linked to #2178443", but we shouldn't do this. When a sentence has been changed, falsification of logs can cause serious problems.
4. Do you save somewhere information about the sentences deleted by Horus? Information about their owners are especially important because some users use only the sentences owned by particular contributors. In the same way, information about tags and the users who added them is also important. I'm sure we're going to take advantage of this information in the future (when more than one users can add an OK tag, for example).

Thank you for the detailed report. I’ll answer each of your points.
1. You’re right, we’ll have to fix this.
2. Yes.
3. Interesting question. How about having 3 log entries in the merged sentence:
• link #3466390 [by marafon]
• unlink #3466390 [by Horus]
• link #2178443 [by Horus]
This would keep logs consistent.
4. Good point.
We technically have the deleted sentence authorship information inside the deduplication log, but it’s not directly usable. This log is just a giant text file of 200MB+ with all the operations performed by Horus at a low level. I’m not sure what’s the best thing to do to ease the life of people who focus on sentences of particular contributors.
Tag authorship is not lost. While the website interface prevents a contributor from adding a tag if someone else already added it, it’s technically possible to have multiple users tagging the same sentence with the same tag. And the deduplication produces such sentences. See for instance:
http://dev.tatoeba.org/jpn/sentences/show/19549
http://dev.tatoeba.org/jpn/sentences/show/404156
http://dev.tatoeba.org/jpn/sentences/show/633544
http://dev.tatoeba.org/jpn/sentences/show/2094558

> How about having 3 log entries in the merged sentence:
Looks nice. Honesty is the best policy. ☺
It's still not easy to find out who linked the sentence to #2178443, though. Well, of course it's Horus. That's clear. But most of us would be more interested to know that it's marafon who linked it to "Мне это не нужно." I wonder if there's a nice way to show it.
The other direction would be easier. It would be nice if #3466390 is highlighted when we click on "linked to #3466390". This way, we can know what marafon did without going to another page. (I'm talking about this feature: http://tatoeba.org/jpn/wall/sho...essage_20026.)
> I’m not sure what’s the best thing to do to ease the life of people who focus on sentences of particular contributors.
It would be probably very nice if there were a simple file for download that shows the owners of the sentences deleted by Horus.
When #2810855 (nancy) and #2990200 (hayastan) were deleted and #1209530 remains, it will look like this:
1209530[tab]nancy
1209530[tab]hayastan
(I don't know Spanish, but I guess learners of languages like Spanish would find this especially useful.)

I also found out that logs and comments sometimes look chaotic when a sentence had been changed.
There are at least three mysteries on this page: http://dev.tatoeba.org/eng/sentences/show/2144030.
1. It looks as if Eckhardtgabriel and yayoi linked this sentence to others when the sentence didn't exist yet, and then CK added the sentence. (This is because the logs don't tell us that Eckhardtgabriel added #483087 "Why you don't take off your coat?")
2. It looks as if CK unlinked #2144030 "Why don't you take off your coat?" from #2143848 "どうしてコートを脱がないのですか." even though they match. (He actually unlinked #483087 "Why you don't take off your coat?" from #2143848.)
3. It looks as if raggione told CK that the word order was wrong and Eckhardtgabriel corrected CK's sentence, which was already correct.
Things would get more complicated when both sentences were changed. We can see the authentic logs of deleted sentences, but we cannot see what really happened before the merger to the ones that survived.
I wish I could make constructive suggestion, but nothing comes to mind now. I'll post again when I have a concrete idea.
By the way, I think there should be a way to go from #2144030 to #483087.
I like loolmeh's idea here https://github.com/Tatoeba/tato...ment-64121681. You can add "#483087を併合" to the logs of #2144030 and "#2144030に併合" to the logs of #483087. If you do this, Horus wouldn't have to post (English) comments anymore.

>This is because the logs don't tell us that Eckhardtgabriel added #483087 "Why you don't take off your coat?"
I think this is because he didn't add it. It's probably from the Tanaka Corpus. Eckhardtgabriel merely adopted it, but it appears as he added it, which, I agree, is confusing.