2010-06-03 08:12
I am not sure whether this is worth discussing, but there are some sentences which are really redundant, e.g.
162883, 83091, two rather long sentences which only differ in the subject being "my mom" vs. "my dad".
Shouldn't we remove one of such pairs and concentrate on the gist instead of wasting our efforts on translating countless variants?
2010-06-12 03:31
Some near duplicates are good since you can compare them, some near duplicates are perhaps just clutter.

* Examples of what may be considered good near duplicates:
- He studies English in the early morning.
- She studies French in the late afternoon.

* Examples of what may be considered to just be clutter.
- He studies English.
- She studies English.
- John studies English
- Fred studies English.
- That man studies English.
- That woman studies English.

Perhaps sentences of the "clutter" type need to be dealt with.
2010-06-12 10:43
Hi CK,
I completely agree with your notion of near duplicates versus clutter.
I think that besides "dealing" with clutter that already exists, we should also put some effort into guidelines about creating new content.
2010-06-11 19:08
Okay I haven't replied to this yet so I will, to make it clear about "variations" of sentences.

Our position is: people can do whatever they like. If they want to add all the possible variations, they can. If they don't want to, they don't have to.

It doesn't hurt to have "near duplicates". It just make Tatoeba a bit noisy. But that's our job, as engineers, to figure out how to filter and organize data so that it can be used efficiently for language learners.

Meanwhile, as sysko said, variations of sentences can be very useful for language processing, so we shouldn't delete them.
2010-06-11 19:43
Just to clarify the clarification. Near duplicates will be removed from WWWJDIC - but not by deleting them from Tatoeba. So feel free to point out Japanese sentences and English sentences linked to Japanese sentences that are near duplicates.
2010-06-19 13:26
Hi Paul, I saw you always post a comment "Not for WWWJDIC" in each sentence. Shouldn't that be solved by using tags?
2010-06-19 13:33
I could, but I started doing that before tags existed.

It also gives people a chance to notice what sentences I'm excluding and ask why (or just complain ;-).
2010-06-19 13:53
So do you filter the sentences according to your comment, or do you mark them somewhere else AND put a comment in?
I just want to know how we should approach sentences we find should not appear there (e.g. hiragana-kanji variants of exactly the same sentence.)
2010-06-19 14:08
In the secret sentence annotation page, where the Japanese index can be entered / edited, I put -1 in the meaning field.

No one else can see that so the note is just to let people know what I'm doing (generally excluding near-duplicate sentences from WWWJDIC).
2010-06-03 10:34
In an other side I'm working with an other guy on a machine-learning based automated translator, and this kind of "near" duplicate sentences are REALLY usefull
2010-06-03 10:37
in fact as a learner I also like to find sometimes this kind of sentences where only a part change, it's easier to see some grammar point this way (because for example in French sentences changing a "my mom" by "my dad" could change the verbs / adjectiv and so in the sentences, which is always interesting to see this variation on the same sentence)
2010-06-03 17:23
On this point, I've chosen to add these nuances in comments. There are otherwise just going to be way too many similar sentences.
2010-06-03 09:24
> Shouldn't we remove one of such pairs and concentrate on
> the gist instead of wasting our efforts on translating
> countless variants?

There is a constant effort to remove near - duplicates. At the current rate we're probably losing a couple of dozen a week, if not more.

However removing duplicates does not produce _new_ content. And new content is what's needed to fill out Tatoeba and make it more appealing.
2010-06-03 10:06
Yes, you are right, producing new content is also important, though I as a native German speaker am right now mostly busy with adding German translations to the already existing Jap-Eng. sentence pairs. And that's when I came across these near-duplicates.
Currently I am thinking about how I could involve my Japanese language exchange partner to produce some content. At least, I will check with her some sentences I found dubious.

So how would be the best procedure if I come across such a sentence pair? Make a comment? Add it to the "mark for deletion" list?
2010-06-03 10:44
moreover I think here the problem is not to have or not this countless variant (for the reasons below I would prefer to keep them), but rather "how to show to contributors only 'usefull' sentences"
2010-06-10 13:00
Please could we have the duplicate removal script run soon? (Before Saturday, anyway)
2010-06-10 17:32
I would like to ask also a manual update of Launchpad translations (Tatoeba > Launchpad sense) for translating all the new stuff :) merci^^
2010-06-11 00:55
2010-06-10 01:29
Re: × 君〔くん〕 ○ 君〔きみ〕

Perhaps, dropping the MeCab "furigana" and encouraging people who need the reading help to use Firefox with the Rikaichan plugin would help solve this kind of problem (and also perhaps speed up Tatoeba a bit.)

As a programmer, I understand the fun of trying to get all these things working correctly, but as a teacher, I would prefer things to be 100% accurate.

Rikaichan shows both readings and meanings for 君. In addition to that, students also get the EDICT English definitions which can often be useful.

Rikaichan's URL:

The following dictionaries are available for Rikaichan.
2010-06-10 11:36
But that would imply that everyone uses Firefox (or has to) ^^

Also, since Tatoeba has a broader public than students in a classroom, it wouldn't necessarily be a good thing to drop the furigana. A lot of people would rather have something that is 80-90% accurate than not having anything at all, because it saves them time. And in the end, it's people's own responsibility to decide whether they want something perfect or not.

Generating furigana is not what slows down Tatoeba the most, you wouldn't see a difference in speed if we took it out, so it wouldn't justify that we take it out.

But as a teacher, you can (and actually *must*) educate your students not to rely on the furigana line, and use Rikaichan instead. It's not incompatible. I use it myself (and always did) whenever I want to figure out the reading of a Japanese sentence in Tatoeba, despite the fact that the reading is already displayed.
2010-06-09 17:05
On reading this comment,, plus seeing sentences that are perhaps not "child friendly" or appropriate for use in school settings, I wonder if perhaps has matured to the point where something similar to Google's "safe search" needs to be considered.

I know that it's quite likely that no one wants to be responsible for deciding where to draw the line. I also have a feeling that some members will see no need to even consider that this is something that may be needed.

However, perhaps one possibility would be to have certain "potentially offensive" sentences only be displayed to members, and also only to members who "opt in" for seeing them.

Consider how a first time visitor may feel if he/she arrives on the main page and an "offensive" sentence is the one that is randomly displayed on the page.

This is something that should probably at least be added to Tatoeba's list of things to be considered.

2010-06-09 17:21
once the tags will be added, we will be able to do the following:
tags "unsuitable_for_children" etC. and maintain a list of tags which will be not accesible for non user/ user which has not active "unsafe search" option
2010-06-10 10:24
moi je suis très très petite
2010-06-10 09:21
Why not add a few more search hints?

For example, I tried the following in the search and it worked.
(For those who can't guess what it will get, try it.)


2010-06-10 09:33
simply because we haven't found yet time to rewrite the hint page :P
2010-06-09 17:58
If okurigana is incorrect, can we just file this as bugs in MeCab?
2010-06-09 18:57
In theory. However they are not really bugs in MeCab, but problems resulting from the dictionary used with MeCab. The dictionary used can both be selected (from a very short list ;-) and can be altered or aided by user-defined dictionaries.

So really what it needs is someone familiar with MeCab to find the best dictionary available and to add fixes for the problems noted.

However it probably is never going to be possible to be 100% accurate so to get the best results manual corrections will be needed at some point.
2010-06-10 08:44
It's not that simple.

Consider the sentence: 君たちの訳文と黒板の訳を比較しなさい。 MeCab suggests わけ as the reading of the solo 訳, whereas we all know it's やく. MeCab's usual dictionary (NAIST-JDIC) has both versions of 訳, and no amount of adding dictionaries is going to "fix" it. MeCab uses some very sophisticated AI to segment sentences, and the dictionaries have parameters derived from training on hand-segmented texts. The trouble is that ...の訳を... could be either, and you need the context of the whole sentence to decide which is which. In fact the weightings for 訳/わけ and 訳/やく as solo lexemes are the same. You could probably fiddle the weights on 訳 to make it produce やく, but most of the solo appearances of 訳 in Tatoeba are in fact わけ.

There is a whole research field of Word Sense Disambiguation (WSD) working on problems related to this, but I don't think there are any packaged solutions for Japanese that can be plugged into Tatoeba. Just be grateful we have MeCab - 20 years ago automatic Japanese segmenters were thought to be impossible to build.
2010-06-09 18:43
2010-06-09 10:38
Help wanted!

I'm looking for someone to help with Tatoeba / WWWJDIC integration. Specifically I'd like someone with database / web experience to work on tools for validating / completing the index data needed to link WWWJDIC dictionary entries to Tatoeba example sentences. If you're interested post here for more details or PM me.
2010-06-08 04:53
revisiting sentence variation...

I came across a sentence in arabic (thx to qahwa's comment) of my sentences :D...It shows a property of the arabic script that I want to document with 3 or 4 variations:

هل استلمتَ الرسالة؟
Did you(male) receive the letter?

هل استلمتِ الرسالة؟
Did you(fem) receive the letter?

هل استلمَتْ الرسالة؟
Did she receive the letter?

without the harakaat (vowel marks) they're all written the same.
Now if I do add them, I'm afraid they'll just get reported as similar sentences and get merged/deleted/etc... (I mean the english sentences ofc)

what's tatoeba's 'official' statement on how to deal with this?
2010-06-08 15:01
Similar things happen in Portuguese

Esse é seu brinquedo.

can be

[Hey, you, ]
This is your toy.

[Bob likes to play.]
This is his toy.

[Mary likes to play.]
This is her toy.

[Tex the armadillo likes to play.]
This is its toy.

There are unambiguous ways to say these sentences in Portuguese, but they are not used that often.
2010-06-08 20:41
I don't think that this is the same problem as in Arabic.
It doesn't cause problems when Portuguese duplicates like your example are merged. But in Arabic it does. The example that saeb posted is the same *without* vowel marks, but with vowel marks, it's not anymore the same, and the pronounciation isn't the same neither. So it *looks* like a duplicate, but it isn't.
2010-06-09 16:07
I don’t know about Arabic, but I personally prefer to differenciate sentences that are different in speech.

E.g. in Russian, Belarusian and Ukrainian one doesn’t normally mark stress, but when it’s important, I do (as in the case with зáмок/замóк in sentences No. 385729 and 385728).
2010-06-08 14:31
New link for the typing game:

Now a little more user friendly and with longer texts.
2010-06-07 15:04
A general remark: Given the huge amount of Japanese data, I think we need more Japanese contributors to help us with the correction and to keep the Japanese sentences more consistent. Talking to my Japanese language partner recently, I came to the conclusion that with its current features, tatoeba is rather unattractive for Japanese natives who want to learn another language, because most sentences are already translated to Japanese. For example, what would be the benefit for my partner who learns German?
2010-06-07 15:28
There are probably more sentences in German that aren't translated into Japanese than you realize.

What you can do is ask Trang or Sysko to add a list
"ger->jpn translations needed"

They can filter the data to see exactly what sentences in German aren't linked to a sentence in Japanese.
2010-06-07 16:14
Dear Trang, dear Sysko:
Would it be possible to do this on the contribution webpage, e.g. when one selects "ger->jpn",to show example sentences in German which have no Japanese equivalent, or is this to heavy a burden for the database? As the current system shows only 15 sentences, maybe this might also reduce the complexity of the search.
2010-06-07 20:36
It's in our plans :)
...and has been for a long long time. But to be honest, I cannot tell you for sure when we can implement it. Perhaps in two weeks, perhaps in one month, perhaps in two months.

In the meantime, I can generate a list of German sentences that have no *direct* translation in Japanese. Most of them will have an indirect translation though.
2010-06-08 00:17
...perhaps in two years :P
2010-06-07 16:44
I've done some checking and there are 9570 German sentences that are not translated into Japanese. Starting with 123 and ending with 399357.

I can send you them in an email if you want.
2010-06-07 20:21
Yes, please do so, though I think it would be preferable to have a list or even a search function for such sentences.
2010-07-19 12:06
just to say it's now possible to view all "german sentences not translated in japanese" here (it will display indirect translations in Japanese, as this way it can be a fine way to view sentences than can be linked)
2010-06-07 20:34
Here are a few things...

1) If your partner has a good level in German, he/she can practice translating from Japanese into German, and you would be checking his/her sentences.

Note that you can view someone's sentences by going to their profile, then clicking on "See this user's contribution" (below the link to send a private message). Then on the user's contributions page, you go to "See all" next to "Latest sentences" (yea it's a bit complicated, but we'll be improve the profile someday). And you can select "German" to only keep German sentences.

2) Your partner can enter in Tatoeba Japanese sentences that he/she would like to know how to say in German, and you can try to translate his/her sentences. It would be good of course to search first if the sentence doesn't already exist.

Basically, you could be chatting together and, as the conversation goes, your partner could add Japanese sentences he/she can't figure out how to say in German. Your partner would then send you the link to the sentence once it's added, and you would translate right after that. This is pretty much what most tandems do anyway, when they chat together "How do you say...?" But instead of keeping that knowledge in your private chat logs, you can share it with the rest of the world :)

It would be a good idea to also make a list out of it, so that you can keep track of the various sentences you learned.
sysko had made such a list:
It resulted from him and Dorenda chatting together in IRC.

3) You partner can practice translating from German to Japanese, while also contributing. Because as I said, most of the German sentences do not have a DIRECT translation in Japanese, but they do have an INDIRECT translation. So your partner could take on the task to link them.

But not just by reading and saying "okay, those translations match". I can create a deu->jpn list for him/her. She would then look at the German sentences from the list, and try to think of what the translation is. Then he/she can click on the sentence to view its details, and most of the time there should be an indirect Japanese translation. If that indirect translation matches what he/she thought of, they he/she can link the two sentences.

The only problem with this is that not everyone can link or unlink sentences yet. Only "trusted users" can (and even then, they can't link or unlink anything). The link/unlink feature requires you to understand very well the structure of Tatoeba, so it's not available to everyone because it's going to be confusing to new users more than it's going to be useful.
But I hope to make the link/unlink feature available to everyone through the concept I just explained.

So those are the ideas I have in mind. There are certainly more creative ones, but I'd have to think harder ^^