スレッド #21231

メニュー

TRANG 2014年12月16日 2014年12月16日 15:40:34 UTC

flag

Report

link

固定リンク

** Sentences deduplication **

We ran the deduplication script on the dev server last weekend. We found some other issues because of sentences such as #613428, which has for duplicates #2505312, #2505313, #2509131, and #2509135.

The issues were fixed. The script completed yesterday. It seems to deduplicate properly.
You can go and check some of the sentences that were deduplicated by looking at Horus' comments[1].
You can also check from the exports files of the dev database: http://downloads.tatoeba.org/dev/
and compare them with the ones from the prod: http://tatoeba.org/eng/downloads

There are 2 problems remaining.

1) The way things are logged can be confusing.
For instance: http://dev.tatoeba.org/eng/sentences/show/613428

Question: is it a problem for anyone if things are logged this way?
We could try and spend more time to make the logs more user friendly, or we could just leave it this way.

2) Comments are currently copied (instead of being moved) to the main sentence.
For instance: http://dev.tatoeba.org/eng/sentences/show/3550769.
If you go to the main sentence: http://dev.tatoeba.org/eng/sentences/show/1926402
you will see that there is a copy of the comment.

Question: do you prefer to have the comments on both the main and the duplicate sentence? Or only on the main sentence?

Thanks in advance for your feedbacks.

-----

[1] http://dev.tatoeba.org/eng/sent.../of_user/Horus

返信を非表示返信を表示

sharptoothed 2014年12月16日 2014年12月16日 16:43:04 UTC

flag

Report

link

固定リンク

1) The logs look somewhat verbose but I'm sure I can live with it. The benefit of de-duplication exceeds this inconvenience greatly, I think. Besides, the subsequent runs of the script will generate much less log entries, I believe.
2) I think it's OK the way it is now.

AlanF_US 2014年12月16日 2014年12月16日 16:45:53 UTC

flag

Report

link

固定リンク

I'm thrilled that we've gotten this far! Thanks to everyone who has made it happen.

I would definitely be in favor of running the script now and making improvements to the log later, if at all. It's already possible to get the information we need now, and we would all benefit from having those duplicates merged as soon as possible.

If and when we do make changes to the logging, is it possible to report the number of the sentence on which an operation was performed as well as the operation? For instance, instead of:

linked to #123456

could we write this?

#111111 linked to #123456

Optionally, that first number could be suppressed where it's identical to the sentence where the log is being displayed. But if it's displayed all the time, that's fine, too. It doesn't take up much horizontal space.

Whether comments are copied or not doesn't matter to me, so I would say the current behavior is fine.

返信を非表示返信を表示

TRANG 2014年12月16日 2014年12月16日 18:40:02 UTC

flag

Report

link

固定リンク

> is it possible to report the number of the sentence on which an operation
> was performed as well as the operation?

It's already displayed, just not displayed clearly.

If you look for instance at my logs:
http://tatoeba.org/eng/contributions/of_user/TRANG

You have:
Sentence #3649129
linked to #3649128

When the logs are on the sentence's page, all the operations only concern the sentence itself. Which is why the number sentence number is not displayed.

If there is a place outside of the sentence's page where the sentence id is not displayed, let me know.

tommy_san 2014年12月16日, 編集 2014年12月16日 2014年12月16日 17:47:16 UTC, 編集 2014年12月16日 17:47:59 UTC

flag

Report

link

固定リンク

I think we should take either intelligibility or exactitude. Probably the latter. Current pages are neither intelligible nor exact.

I'm still not happy with sentences like http://dev.tatoeba.org/fre/sentences/show/2144030 (as I wrote earlier in http://tatoeba.org/eng/wall/sho...essage_21045).

I agree with Alan and think that sentence numbers should be displayed. That would solve most problems. Besides, every addition of a sentence should be logged.

Eckhardtgabriel - Aug 27th 2010, 19:49
added #483087
Why you don't take off your coat?

CK - Jan 14th 2013, 15:57
unlinked #483087 from #2143848

raggione
May 9th 2014, 22:22
[Comment to #483087]
Word order: Why don't you take off our coat?

I know this doesn't look very nice, but each sentence number has its history. You can't just pretend they were the same.

返信を非表示返信を表示

TRANG 2014年12月16日 2014年12月16日 19:25:11 UTC

flag

Report

link

固定リンク

> I agree with Alan and think that sentence numbers should be displayed.
> That would solve most problems.

As I've replied to Alan, the sentence numbers are displayed on all the pages other than the sentence's page. On the sentence's page, all the operations only concern the sentence itself so the number would be repeated everywhere. You won't see something like "#2 linked to #3" on sentence #1.

Regarding the the comments, the question is whether it's necessary to fix, which means the deduplication could be delayed for an extra week, or if it's fine to have some data that is not completely coherent, and have the script run earlier.

As far as I'm concerned, I don't think there is a hurry to run the script so we could take the extra time, or an extra month if needed. But I know that people have been waiting for the deduplication for years, literally, and may just be alright with not having the perfect script.

In order to solve your problem, the simplest solution I can think of would be:

1) We don't copy the logs of the duplicates, we leave them there. Once the deduplication is done, we add a comment on the main sentence to indicate what were the duplicates found.
For instance. "This sentence had duplicates: #123, #456, #789. The duplicates have been deleted."

2) We do copy the comments of the duplicates, but we add a short message. "Comment copied from #123 because of duplicate merge".

返信を非表示返信を表示

tommy_san 2014年12月16日 2014年12月16日 20:25:04 UTC

flag

Report

link

固定リンク

I basically agree, but is Horus going to speak only English?

返信を非表示返信を表示

TRANG 2014年12月16日 2014年12月16日 22:17:14 UTC

flag

Report

link

固定リンク

For the moment yes. There are definitely solutions to translate Horus' comments but I see nothing easy enough that would be worth investing the time and delaying the deduplication.

TRANG 2014年12月17日 2014年12月17日 14:49:43 UTC

flag

Report

link

固定リンク

Considering that it will be tedious to change the text of the messages added by Horus after the deduplication, I'd like to have your opinion on what should be written.
Keep in mind that the text will be only in English, it won't be translated like the rest of the interface. So we should try to make it simple enough for people who do not speak English.

I'll post some suggestions below. Feel free to suggest something else.

1) The message on the deleted duplicate.

- http://dev.tatoeba.org/eng/sent...comment-487064
- http://dev.tatoeba.org/eng/sent...comment-487062
- http://dev.tatoeba.org/eng/sent...comment-487063
- http://dev.tatoeba.org/eng/sent...comment-487060

2) The message on the remaining sentence

- http://dev.tatoeba.org/eng/sent...comment-487067

3) The extra info in the comments that were copied.

- http://dev.tatoeba.org/eng/sent...comment-434496
- http://dev.tatoeba.org/eng/sent...comment-469692
- http://dev.tatoeba.org/eng/sent...comment-485258

sacredceltic 2014年12月20日 2014年12月20日 19:36:15 UTC

flag

Report

link

固定リンク

Dans l'exemple que vous fournissez, vous indiquez avoir procédé à 4 suppressions, mais je ne vois la trace que de 3 doublons dans le log...D'où sort le 4e ?

返信を非表示返信を表示

TRANG 2014年12月21日 2014年12月21日 7:18:06 UTC

flag

Report

link

固定リンク

Le dernier doublon n'apparaît pas dans les logs de la phrase restante car il n'était pas lié à cette phrase, contrairement aux 3 autres doublons.

返信を非表示返信を表示

sacredceltic 2014年12月21日 2014年12月21日 13:33:20 UTC

flag

Report

link

固定リンク

seuls les doublons qui étaient liés sont tracés ? Mais alors comment fait-on pour savoir quels doublons ont été fusionnés ?

返信を非表示返信を表示

TRANG 2014年12月22日 2014年12月22日 21:19:34 UTC

flag

Report

link

固定リンク

Le script, en l'état actuel, n'ajoute de commentaire que sur le doublon supprimé. Donc on peut systématiquement retrouver la phrase qui a été gardée à partir de la phrase supprimée, mais pas l'inverse. Cependant cela fait partie des améliorations qu'on voudrait implémenter avant de lancer la fusion des doublon pour de vrai sur Tatoeba.

https://github.com/Tatoeba/tatoeba2/issues/540

返信を非表示返信を表示

sacredceltic 2014年12月22日 2014年12月22日 21:48:16 UTC

flag

Report

link

固定リンク

Pouvoir retrouver les doublons supprimés me paraît en effet être un préalable essentiel...

メニュー

Need some help?

Developers

About