Wall (7,114 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
menaud
yesterday
Ooneykcall
3 days ago
hecko
3 days ago
boracasli2
3 days ago
TATAR1
5 days ago
odexed
5 days ago
TATAR1
6 days ago
Thanuir
6 days ago
boracasli2
8 days ago
rdgscratch
8 days ago

Duplicate removal script
Could this be run soonish? I see that sentences that were duplicates 9 days ago haven't been merged yet.

I've seen that the index data log is working better now - recent changes are showing up in the right order with the actual time changed.
Could deletions be noted in the log as well?

Yes, it is planned, but not for soon... I can't guarantee it will be done before another two months.

sentences.csv
I'm looking through this file output now. There are a number of spurious '\' symbols and line feeds in it. Possibly there are odd characters in the sentence text?
The suspect sentences are 208 and 5540. One of them is owned so I couldn't try editing it.

I looked at just one as a test ...
> 288755 He is delighted at your success. 彼女はあなたの成功を喜んでいます。
As I suspected this was not present in the last version of the Tanaka Corpus _I_ was maintaining. It was either corrected or deleted as a near duplicate. I shudder to think of how many corrections may have been 'lost' in this way. >_<

Hmm, I'm not sure what you mean... I don't think there has been anything lost in Tatoeba.
For the sentence you indicated, it does not mismatch in Tatoeba.
He is delighted at your success.
彼はあなたの成功を喜んでいます。
(cf. http://tatoeba.org/eng/sentences/show/288755)
And I checked randomly other lines from CK's file. All those I tried turned out to be safe in Tatoeba, they do match.

Well if it's not my stuff, and it's not your stuff then it must be CK's stuff where the problem is.
That's fine with me. :-P

[not needed anymore- removed by CK]

I just opened that file and it has the line
A: 彼はあなたの成功を喜んでいます。 He is delighted at your success.#ID=288755
B: 彼(かれ) は 貴方(あなた)[01]{あなた} 乃{の} 成功 を 喜ぶ{喜んでいます}

I forgot to do a download at the weekend. Downloading now....

[not needed anymore- removed by CK]

I suspect there was some damage done when the Tanaka Corpus moved over to Tatoeba. A little bit of history - Tatoeba first started by adding French and other translations to the Tanaka Corpus. However the version of the Corpus they used was horribly out of date and contained very many errors.
In theory that should have all been sorted out, but these he/she errors look like something that should have been noticed and fixed long ago.
Anyway, if you send me a .zipped file I can work from that (or I could use the same sort of filters to find the same set myself). See the PM I sent you for my email address.

@blay_paul,
> I suspect there was some damage done when the Tanaka
> Corpus moved over to Tatoeba.
As far as I know, there shouldn't have been any damage. You used to send me an export of your version of the Tanaka Corpus once in a while, and I would basically replace the Tanaka sentences in Tatoeba with your data.
This was our way to make sure the Tanaka sentences in Tatoeba synchonised with those in WWWJDIC.
> A little bit of history - Tatoeba first started by
> adding French and other translations to the Tanaka
> Corpus. However the version of the Corpus they used
> was horribly out of date and contained very many
> errors.
It wasn't exactly Tatoeba. It was a project undertaken by a webmaster from a French website for Japanese learning. He then gave these sentences to Tatoeba.
But it is true that they were based on an old version of the Corpus. However, I did some filtering before importing the French sentences, to import only those that would match a sentence in the latest version of the Tanaka Corpus at that moment. Out of 25,000 translations, I could only keep 18,000.

> As far as I know, there shouldn't have been any damage.
Well the only thing I have noticed for certain so far (although I don't know which side the problem originated on) is that I had split all the [F], [M], [Proverb] tags into Japanese side and English side. For some reason they are all on the English side now.
I still have the data I last worked on so I can do something about that at some point.

It looks like I was working from false assumptions.

@CK could you please avoid post so long message in the wall ^^, if you have big long list like this, you can send us a email team [a t] tatoeba dot fr , this way it will be easier for us to treat this ;-)
(and by the way after, could delete these "flood" messages ^^)

About fixing those he/she problems. Doing it the simplest way may produce problems with second hand miss-matches generated where other languages than just English and Japanese were involved. I've got some ideas to deal with that, if you'll trust me with the job ;-)

> There were over 15,000 mismatches with just "He" vs. "彼女は".
I got just 17 for that search on the last PD version.

OK, I have now fixed the 56 or so ones I found from the sentence pairs used in WWWJDIC. I suspect most of the other 20,000 odd were sentences removed or corrected with the Tanaka Corpus used in WWWJDIC but inadvertently re-introduced via Tatoeba.
So, please do not do any automatic updates to any Japanese sentences with index data or to any English sentences pointed to by the 'meaning' fields.
All the rest are fair game - and probably most of them should be deleted. :-P

'Translations' from English to English.
I don't think I've had an official policy statement on this, so what is the position on 'translating' between the _same_ language?
I think it is inadvisable and would suggest that there be a warning put up about it and a database search done to create a list of problematic links.

the question has already been asked and the answer is "yes we can"
because some sentences can have exactly the same meaning
after if the two sentences are different because one is "old" english and one "youth" english then they should not be linked
Later in the future, we plan to have "qualified" links which mean in the future you will be able to link with a "translation" link a "synonynms" links "antonyms" link etc...
and with this you will be able to link an "old" english and "youth" english version of 2 sentences
so for the moment you add an english translation to an other english ONLY if the 2 sentences can be translated by the same sentences (you see what I mean)

I think that should be noted prominently somewhere like
http://blog.tatoeba.org/2010/02...n-tatoeba.html

Yep you're right

I think it's a reasonable thing to do to explain archaic language and idiomatic expressions.
I've sometimes added alternative translations for such sentences in the English sentences needing confirmation list, though not by linking together the English sentences. I don't see any reason why two same-language sentences should not be linked together if they have the same meaning, though.

> I don't see any reason why two same-language
> sentences should not be linked together if
> they have the same meaning, though.
I dislike that on principle because Tatoeba is a collection of translations, not a collection of explanations.
On a more practical level it can lead to direct translations being shown as if they were indirect, and indirect translations not showing up at all.
Suppose you have three sentences:
A (Japanese), B (English), C (English)
Assume that B is the translation of A and B and C have the same meaning.
If that is stored as
A <----> B <----> C
then C will show up as an indirect translation of A.
If it is stored as
A <----> B
\----> C
Then B and C will both be visible as direct translations of A.
If you then had a French sentence, D, as a translation of C then in the second pattern it would not be visible from the Japanese sentence at all.

> I dislike that on principle because Tatoeba is a collection of translations, not a collection of explanations.
How about if it had been a "language learning resource"? :)
Both the cases in your example are already commonly seen for sentences in different languages, with the problems you state, and should be solved by linking together all three.

> and should be solved by linking together all three.
That just means that if someone gets it wrong first then somebody else has to go round fixing the problem later. Of course it won't be possible to guarantee that it is _always_ the right thing to do because it is quite possible to have an
A, B and C
where
B (English) is a translation of A (Japanese),
C (English) can have the same meaning as B (English),
but
C is not a translation of A
(Repeat for other languages as necessary).
I just think same-language linking is an unnecessary added layer of complexity that will come back to bite us (and particularly those* working on things at a database level) later.
* i.e. me.

Again, I don't see how that's any different when B and C are different languages.

[not needed anymore- removed by CK]

> It's hard enough to learn a foreign language without
> complicating it by an over exposure to archaic language.
I am certainly in favour of having the simplest English translation being the one displayed in WWWJDIC. I just don't want the difficult-but-valid stuff deleted. The problem with that is the question of what the users are trying to learn. Simply put, if you take out the difficult language then they will have no opportunity to learn the hard stuff.
If someone comes across a confusing word or phrase in a foreign language they're going to want to find that word/phrase - not be left hanging by an over-sanitized 'learners' resource.

On a personal note, if you happen to be widely read - particularly in works of historical and/or fantasy fiction - then you're likely to have a much richer vocabulary in this area than otherwise.

[not needed anymore- removed by CK]

I don't get this. The other language _do_ show while you're typing in a translation. Are you sure you're not
http://content.pyzam.com/funnyp...ngItWrong6.jpg
?

I see the same behaviour. When you click on the "あ->a" image it turns the list of translations into a form to enter the new translation.
Personally, I don't mind this behaviour. The idea is to translate that one sentence, which may or may not correspond to the other translations. I do admit that I often read the other translations to better understand the context.

Oh, now I'm with you! Yes, when you're translating the only sentence you should really be paying attention to is the one you are translating from.
If you don't understand the sentence you're working from you shouldn't be translating it in the first place.

I don't think CK meant he didn't understand the question, but just that he wanted to make the translation better match the others.

In fact it was hidden, because before people really often don't understand the fact there is a "main sentence" and others are just translations and so most of the time they were translating one of the translation instead of translating the main one. So hidden + message in red was the simplest way we found yet to make it more obvious.

Hello. I have a problem. I want to search for sentences in japanese, but the search function seems to have changed. I used to type in 沸かす for example, to see exactly when it is used completely as "wakasu" - not wakashite, not the 沸 kanji itself - in other words I could do exact searches.
Right now, when I search for a number of characters together the engine seems to give me mixed results... Is this an issue or is this the way it is going to be from now on? I think it completely destroys the aim here, and I'd love to know if there is a way to make exact searches in Japanese.

This is an issue, as we've switching from an engine to an other, we will try to fix this asap

Oh great - I thought it would be permanent. Great to hear this!

ENAMDICT ?
What about including names in the index information? I know Jim isn't going to want it in the output he uses - but I suppose it could be stripped from the WWWJDIC.csv file.
Possible advantages include
* Ability to link to ENAMDICT from sentences (when linking from words in sentences is developed).
* Might be useful in checking MeCab output.

New icon?
What's the blue circle with an 'i' in it do?

It's actually not so new. You can also see it from here:
http://tatoeba.org/sentences/my_sentences
We used to display it in the sentences menu, so that people could go to the standard display (in Browse). Then we took it out and made the text of the sentence as a link instead.
But then there was a problem when you adopted a sentence, you couldn't browse to the sentence because it would be editable (clicking on it would display the input field to edit the sentence).
So we added back this little icon, so that you can browse to the sentence even when it belongs to you.