menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
blay_paul blay_paul April 27, 2010 April 27, 2010 at 12:24:48 PM UTC link Permalink

Duplicate removal script.

I don't know how the script works exactly, but I think it may be missing a step.

Suppose we have

100000 Hello.
100001 こんにちは。
100002 Hi.

100001 is linked to 100000
100001 has the meaning field of 100000

Now, suppose someone decides that 'Hello' and 'Hi' are close enough to not need both.

100000 Hello.
100001 こんにちは。
100002 Hi. ---> Hello.

Then suppose the script removes 100000.

100001 こんにちは。
100002 Hello.

Is 100001 still linked to 100000? It should be linked to the duplicate 100002 instead.

Does 100001 still have the meaning field of 100000? It should have the meaning field of 100002 instead.

In other words is Sentence A is removed as a duplicate of Sentence B then all the links that pointed to Sentence A should now point to Sentence B instead.

{{vm.hiddenReplies[627] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 27, 2010 April 27, 2010 at 1:08:05 PM UTC link Permalink

the remove duplicate script does the following

identify all the sentence which have both the same language and the same text
and after it will keep the oldest sentence which are owned by someone (or the oldest one if none of the duplicate belongs to someone) and then will relink all links to the duplicate to this one
(so comments / translations / lists etc... etc.. )
and finally will remove the duplicate and keep only one
so the script will not produce any broken reference to a removed sentences

{{vm.hiddenReplies[628] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 27, 2010 April 27, 2010 at 1:22:14 PM UTC link Permalink

> so the script will not produce any broken reference to
> a removed sentences

There are, however, some broken references being produced. It's not clear how though.

236727 あなたには姉妹がいますか。
was linked to 71123, which now no longer exists.
69566 Do you have any sisters?
does exist and was indirectly linked from 236727.

I don't know when 71123 was removed, why it was removed, or how it was removed, but something obviously went wrong somewhere. (It was one of the \N records last week - so it obviously isn't a recent deletion)

Hopefully these broken links are left over from earlier times and won't be reoccurring.

{{vm.hiddenReplies[629] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 27, 2010 April 27, 2010 at 1:25:08 PM UTC link Permalink

ok at least the remove duplicate script will not produce anymore broken links

Dorenda Dorenda April 27, 2010 April 27, 2010 at 2:10:55 PM UTC link Permalink

> identify all the sentence which have both the same language and the same text

So it also merges duplicates that are not linked whatsoever?

{{vm.hiddenReplies[631] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 27, 2010 April 27, 2010 at 2:15:01 PM UTC link Permalink

Yep, that way even if new comers add
I love you
and translate it,

as I love you already exist, the script will delete the new "I love you" and link the translation to the old "I love you" (or also removed it, if the translation already exists too)