menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
TRANG TRANG December 16, 2014 December 16, 2014 at 3:40:34 PM UTC link Permalink

​** Sentences deduplication **

We ran the deduplication script on the dev server last weekend. We found some other issues because of sentences such as #613428, which has for duplicates #2505312, #2505313, #2509131, and #2509135.

The issues were fixed. The script completed yesterday. It seems to deduplicate properly.
You can go and check some of the sentences that were deduplicated by looking at Horus' comments[1].
You can also check from the exports files of the dev database: http://downloads.tatoeba.org/dev/
and compare them with the ones from the prod: http://tatoeba.org/eng/downloads


There are 2 problems remaining.

1) The way things are logged can be confusing.
For instance: http://dev.tatoeba.org/eng/sentences/show/613428

Question: is it a problem for anyone if things are logged this way?
We could try and spend more time to make the logs more user friendly, or we could just leave it this way.


2) Comments are currently copied (instead of being moved) to the main sentence.
For instance: http://dev.tatoeba.org/eng/sentences/show/3550769.
If you go to the main sentence: http://dev.tatoeba.org/eng/sentences/show/1926402
you will see that there is a copy of the comment.

Question: do you prefer to have the comments on both the main and the duplicate sentence? Or only on the main sentence?


Thanks in advance for your feedbacks.


-----

[1] http://dev.tatoeba.org/eng/sent.../of_user/Horus

{{vm.hiddenReplies[21231] ? 'expand_more' : 'expand_less'}} hide replies show replies
sharptoothed sharptoothed December 16, 2014 December 16, 2014 at 4:43:04 PM UTC link Permalink

1) The logs look somewhat verbose but I'm sure I can live with it. The benefit of de-duplication exceeds this inconvenience greatly, I think. Besides, the subsequent runs of the script will generate much less log entries, I believe.
2) I think it's OK the way it is now.

AlanF_US AlanF_US December 16, 2014 December 16, 2014 at 4:45:53 PM UTC link Permalink

I'm thrilled that we've gotten this far! Thanks to everyone who has made it happen.

I would definitely be in favor of running the script now and making improvements to the log later, if at all. It's already possible to get the information we need now, and we would all benefit from having those duplicates merged as soon as possible.

If and when we do make changes to the logging, is it possible to report the number of the sentence on which an operation was performed as well as the operation? For instance, instead of:

linked to #123456

could we write this?

#111111 linked to #123456

Optionally, that first number could be suppressed where it's identical to the sentence where the log is being displayed. But if it's displayed all the time, that's fine, too. It doesn't take up much horizontal space.

Whether comments are copied or not doesn't matter to me, so I would say the current behavior is fine.

{{vm.hiddenReplies[21233] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG December 16, 2014 December 16, 2014 at 6:40:02 PM UTC link Permalink

> is it possible to report the number of the sentence on which an operation
> was performed as well as the operation?

It's already displayed, just not displayed clearly.

If you look for instance at my logs:
http://tatoeba.org/eng/contributions/of_user/TRANG

You have:
Sentence #3649129
linked to #3649128

When the logs are on the sentence's page, all the operations only concern the sentence itself. Which is why the number sentence number is not displayed.

If there is a place outside of the sentence's page where the sentence id is not displayed, let me know.

tommy_san tommy_san December 16, 2014, edited December 16, 2014 December 16, 2014 at 5:47:16 PM UTC, edited December 16, 2014 at 5:47:59 PM UTC link Permalink

I think we should take either intelligibility or exactitude. Probably the latter. Current pages are neither intelligible nor exact.

I'm still not happy with sentences like http://dev.tatoeba.org/fre/sentences/show/2144030 (as I wrote earlier in http://tatoeba.org/eng/wall/sho...essage_21045).

I agree with Alan and think that sentence numbers should be displayed. That would solve most problems. Besides, every addition of a sentence should be logged.

Eckhardtgabriel - Aug 27th 2010, 19:49
added #483087
Why you don't take off your coat?

CK - Jan 14th 2013, 15:57
unlinked #483087 from #2143848

raggione
May 9th 2014, 22:22
[Comment to #483087]
Word order: Why don't you take off our coat?

I know this doesn't look very nice, but each sentence number has its history. You can't just pretend they were the same.

{{vm.hiddenReplies[21234] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG December 16, 2014 December 16, 2014 at 7:25:11 PM UTC link Permalink

> I agree with Alan and think that sentence numbers should be displayed.
> That would solve most problems.

As I've replied to Alan, the sentence numbers are displayed on all the pages other than the sentence's page. On the sentence's page, all the operations only concern the sentence itself so the number would be repeated everywhere. You won't see something like "#2 linked to #3" on sentence #1.


Regarding the the comments, the question is whether it's necessary to fix, which means the deduplication could be delayed for an extra week, or if it's fine to have some data that is not completely coherent, and have the script run earlier.

As far as I'm concerned, I don't think there is a hurry to run the script so we could take the extra time, or an extra month if needed. But I know that people have been waiting for the deduplication for years, literally, and may just be alright with not having the perfect script.

In order to solve your problem, the simplest solution I can think of would be:

1) We don't copy the logs of the duplicates, we leave them there. Once the deduplication is done, we add a comment on the main sentence to indicate what were the duplicates found.
For instance. "This sentence had duplicates: #123, #456, #789. The duplicates have been deleted."

2) We do copy the comments of the duplicates, but we add a short message. "Comment copied from #123 because of duplicate merge".

{{vm.hiddenReplies[21236] ? 'expand_more' : 'expand_less'}} hide replies show replies
tommy_san tommy_san December 16, 2014 December 16, 2014 at 8:25:04 PM UTC link Permalink

I basically agree, but is Horus going to speak only English?

{{vm.hiddenReplies[21237] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG December 16, 2014 December 16, 2014 at 10:17:14 PM UTC link Permalink

For the moment yes. There are definitely solutions to translate Horus' comments but I see nothing easy enough that would be worth investing the time and delaying the deduplication.

TRANG TRANG December 17, 2014 December 17, 2014 at 2:49:43 PM UTC link Permalink

Considering that it will be tedious to change the text of the messages added by Horus after the deduplication, I'd like to have your opinion on what should be written.
Keep in mind that the text will be only in English, it won't be translated like the rest of the interface. So we should try to make it simple enough for people who do not speak English.

I'll post some suggestions below. Feel free to suggest something else.


1) The message on the deleted duplicate.

- http://dev.tatoeba.org/eng/sent...comment-487064
- http://dev.tatoeba.org/eng/sent...comment-487062
- http://dev.tatoeba.org/eng/sent...comment-487063
- http://dev.tatoeba.org/eng/sent...comment-487060


2) The message on the remaining sentence

- http://dev.tatoeba.org/eng/sent...comment-487067


3) The extra info in the comments that were copied.

- http://dev.tatoeba.org/eng/sent...comment-434496
- http://dev.tatoeba.org/eng/sent...comment-469692
- http://dev.tatoeba.org/eng/sent...comment-485258

sacredceltic sacredceltic December 20, 2014 December 20, 2014 at 7:36:15 PM UTC link Permalink

Dans l'exemple que vous fournissez, vous indiquez avoir procédé à 4 suppressions, mais je ne vois la trace que de 3 doublons dans le log...D'où sort le 4e ?

{{vm.hiddenReplies[21256] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG December 21, 2014 December 21, 2014 at 7:18:06 AM UTC link Permalink

Le dernier doublon n'apparaît pas dans les logs de la phrase restante car il n'était pas lié à cette phrase, contrairement aux 3 autres doublons.

{{vm.hiddenReplies[21261] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic December 21, 2014 December 21, 2014 at 1:33:20 PM UTC link Permalink

seuls les doublons qui étaient liés sont tracés ? Mais alors comment fait-on pour savoir quels doublons ont été fusionnés ?

{{vm.hiddenReplies[21263] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG December 22, 2014 December 22, 2014 at 9:19:34 PM UTC link Permalink

Le script, en l'état actuel, n'ajoute de commentaire que sur le doublon supprimé. Donc on peut systématiquement retrouver la phrase qui a été gardée à partir de la phrase supprimée, mais pas l'inverse. Cependant cela fait partie des améliorations qu'on voudrait implémenter avant de lancer la fusion des doublon pour de vrai sur Tatoeba.

https://github.com/Tatoeba/tatoeba2/issues/540

{{vm.hiddenReplies[21271] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic December 22, 2014 December 22, 2014 at 9:48:16 PM UTC link Permalink

Pouvoir retrouver les doublons supprimés me paraît en effet être un préalable essentiel...