162883, 83091, two rather long sentences which only differ in the subject being "my mom" vs. "my dad".
Shouldn't we remove one of such pairs and concentrate on the gist instead of wasting our efforts on translating countless variants?
> the gist instead of wasting our efforts on translating
> countless variants?
There is a constant effort to remove near - duplicates. At the current rate we're probably losing a couple of dozen a week, if not more.
However removing duplicates does not produce _new_ content. And new content is what's needed to fill out Tatoeba and make it more appealing.
Currently I am thinking about how I could involve my Japanese language exchange partner to produce some content. At least, I will check with her some sentences I found dubious.
So how would be the best procedure if I come across such a sentence pair? Make a comment? Add it to the "mark for deletion" list?
Our position is: people can do whatever they like. If they want to add all the possible variations, they can. If they don't want to, they don't have to.
It doesn't hurt to have "near duplicates". It just make Tatoeba a bit noisy. But that's our job, as engineers, to figure out how to filter and organize data so that it can be used efficiently for language learners.
Meanwhile, as sysko said, variations of sentences can be very useful for language processing, so we shouldn't delete them.
It also gives people a chance to notice what sentences I'm excluding and ask why (or just complain ;-).
I just want to know how we should approach sentences we find should not appear there (e.g. hiragana-kanji variants of exactly the same sentence.)
* Examples of what may be considered good near duplicates:
- He studies English in the early morning.
- She studies French in the late afternoon.
* Examples of what may be considered to just be clutter.
- He studies English.
- She studies English.
- John studies English
- Fred studies English.
- That man studies English.
- That woman studies English.
Perhaps sentences of the "clutter" type need to be dealt with.