Wall (7,175 threads)
Mga mungkahi
Bago magtanong, siguruhing basahin ang pahinang Mga Madalas Itanong.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
solutionsh70
1 hours ago
Thanuir
1 days ago
LeviHighway
2 days ago
frpzzd
2 days ago
LeviHighway
2 days ago
LeviHighway
2 days ago
AlanF_US
2 days ago
LeviHighway
3 days ago
PaulP
3 days ago
Rok
5 days ago

I looked at just one as a test ...
> 288755 He is delighted at your success. 彼女はあなたの成功を喜んでいます。
As I suspected this was not present in the last version of the Tanaka Corpus _I_ was maintaining. It was either corrected or deleted as a near duplicate. I shudder to think of how many corrections may have been 'lost' in this way. >_<

Hmm, I'm not sure what you mean... I don't think there has been anything lost in Tatoeba.
For the sentence you indicated, it does not mismatch in Tatoeba.
He is delighted at your success.
彼はあなたの成功を喜んでいます。
(cf. http://tatoeba.org/eng/sentences/show/288755)
And I checked randomly other lines from CK's file. All those I tried turned out to be safe in Tatoeba, they do match.

Well if it's not my stuff, and it's not your stuff then it must be CK's stuff where the problem is.
That's fine with me. :-P

[not needed anymore- removed by CK]

I just opened that file and it has the line
A: 彼はあなたの成功を喜んでいます。 He is delighted at your success.#ID=288755
B: 彼(かれ) は 貴方(あなた)[01]{あなた} 乃{の} 成功 を 喜ぶ{喜んでいます}

I forgot to do a download at the weekend. Downloading now....

[not needed anymore- removed by CK]

I suspect there was some damage done when the Tanaka Corpus moved over to Tatoeba. A little bit of history - Tatoeba first started by adding French and other translations to the Tanaka Corpus. However the version of the Corpus they used was horribly out of date and contained very many errors.
In theory that should have all been sorted out, but these he/she errors look like something that should have been noticed and fixed long ago.
Anyway, if you send me a .zipped file I can work from that (or I could use the same sort of filters to find the same set myself). See the PM I sent you for my email address.

@blay_paul,
> I suspect there was some damage done when the Tanaka
> Corpus moved over to Tatoeba.
As far as I know, there shouldn't have been any damage. You used to send me an export of your version of the Tanaka Corpus once in a while, and I would basically replace the Tanaka sentences in Tatoeba with your data.
This was our way to make sure the Tanaka sentences in Tatoeba synchonised with those in WWWJDIC.
> A little bit of history - Tatoeba first started by
> adding French and other translations to the Tanaka
> Corpus. However the version of the Corpus they used
> was horribly out of date and contained very many
> errors.
It wasn't exactly Tatoeba. It was a project undertaken by a webmaster from a French website for Japanese learning. He then gave these sentences to Tatoeba.
But it is true that they were based on an old version of the Corpus. However, I did some filtering before importing the French sentences, to import only those that would match a sentence in the latest version of the Tanaka Corpus at that moment. Out of 25,000 translations, I could only keep 18,000.

> As far as I know, there shouldn't have been any damage.
Well the only thing I have noticed for certain so far (although I don't know which side the problem originated on) is that I had split all the [F], [M], [Proverb] tags into Japanese side and English side. For some reason they are all on the English side now.
I still have the data I last worked on so I can do something about that at some point.

It looks like I was working from false assumptions.

@CK could you please avoid post so long message in the wall ^^, if you have big long list like this, you can send us a email team [a t] tatoeba dot fr , this way it will be easier for us to treat this ;-)
(and by the way after, could delete these "flood" messages ^^)

About fixing those he/she problems. Doing it the simplest way may produce problems with second hand miss-matches generated where other languages than just English and Japanese were involved. I've got some ideas to deal with that, if you'll trust me with the job ;-)

> There were over 15,000 mismatches with just "He" vs. "彼女は".
I got just 17 for that search on the last PD version.

OK, I have now fixed the 56 or so ones I found from the sentence pairs used in WWWJDIC. I suspect most of the other 20,000 odd were sentences removed or corrected with the Tanaka Corpus used in WWWJDIC but inadvertently re-introduced via Tatoeba.
So, please do not do any automatic updates to any Japanese sentences with index data or to any English sentences pointed to by the 'meaning' fields.
All the rest are fair game - and probably most of them should be deleted. :-P

'Translations' from English to English.
I don't think I've had an official policy statement on this, so what is the position on 'translating' between the _same_ language?
I think it is inadvisable and would suggest that there be a warning put up about it and a database search done to create a list of problematic links.

the question has already been asked and the answer is "yes we can"
because some sentences can have exactly the same meaning
after if the two sentences are different because one is "old" english and one "youth" english then they should not be linked
Later in the future, we plan to have "qualified" links which mean in the future you will be able to link with a "translation" link a "synonynms" links "antonyms" link etc...
and with this you will be able to link an "old" english and "youth" english version of 2 sentences
so for the moment you add an english translation to an other english ONLY if the 2 sentences can be translated by the same sentences (you see what I mean)

I think that should be noted prominently somewhere like
http://blog.tatoeba.org/2010/02...n-tatoeba.html

Yep you're right

I think it's a reasonable thing to do to explain archaic language and idiomatic expressions.
I've sometimes added alternative translations for such sentences in the English sentences needing confirmation list, though not by linking together the English sentences. I don't see any reason why two same-language sentences should not be linked together if they have the same meaning, though.

> I don't see any reason why two same-language
> sentences should not be linked together if
> they have the same meaning, though.
I dislike that on principle because Tatoeba is a collection of translations, not a collection of explanations.
On a more practical level it can lead to direct translations being shown as if they were indirect, and indirect translations not showing up at all.
Suppose you have three sentences:
A (Japanese), B (English), C (English)
Assume that B is the translation of A and B and C have the same meaning.
If that is stored as
A <----> B <----> C
then C will show up as an indirect translation of A.
If it is stored as
A <----> B
\----> C
Then B and C will both be visible as direct translations of A.
If you then had a French sentence, D, as a translation of C then in the second pattern it would not be visible from the Japanese sentence at all.

> I dislike that on principle because Tatoeba is a collection of translations, not a collection of explanations.
How about if it had been a "language learning resource"? :)
Both the cases in your example are already commonly seen for sentences in different languages, with the problems you state, and should be solved by linking together all three.

> and should be solved by linking together all three.
That just means that if someone gets it wrong first then somebody else has to go round fixing the problem later. Of course it won't be possible to guarantee that it is _always_ the right thing to do because it is quite possible to have an
A, B and C
where
B (English) is a translation of A (Japanese),
C (English) can have the same meaning as B (English),
but
C is not a translation of A
(Repeat for other languages as necessary).
I just think same-language linking is an unnecessary added layer of complexity that will come back to bite us (and particularly those* working on things at a database level) later.
* i.e. me.

Again, I don't see how that's any different when B and C are different languages.

[not needed anymore- removed by CK]

> It's hard enough to learn a foreign language without
> complicating it by an over exposure to archaic language.
I am certainly in favour of having the simplest English translation being the one displayed in WWWJDIC. I just don't want the difficult-but-valid stuff deleted. The problem with that is the question of what the users are trying to learn. Simply put, if you take out the difficult language then they will have no opportunity to learn the hard stuff.
If someone comes across a confusing word or phrase in a foreign language they're going to want to find that word/phrase - not be left hanging by an over-sanitized 'learners' resource.

On a personal note, if you happen to be widely read - particularly in works of historical and/or fantasy fiction - then you're likely to have a much richer vocabulary in this area than otherwise.

[not needed anymore- removed by CK]

I don't get this. The other language _do_ show while you're typing in a translation. Are you sure you're not
http://content.pyzam.com/funnyp...ngItWrong6.jpg
?

I see the same behaviour. When you click on the "あ->a" image it turns the list of translations into a form to enter the new translation.
Personally, I don't mind this behaviour. The idea is to translate that one sentence, which may or may not correspond to the other translations. I do admit that I often read the other translations to better understand the context.

Oh, now I'm with you! Yes, when you're translating the only sentence you should really be paying attention to is the one you are translating from.
If you don't understand the sentence you're working from you shouldn't be translating it in the first place.

I don't think CK meant he didn't understand the question, but just that he wanted to make the translation better match the others.

In fact it was hidden, because before people really often don't understand the fact there is a "main sentence" and others are just translations and so most of the time they were translating one of the translation instead of translating the main one. So hidden + message in red was the simplest way we found yet to make it more obvious.

Hello. I have a problem. I want to search for sentences in japanese, but the search function seems to have changed. I used to type in 沸かす for example, to see exactly when it is used completely as "wakasu" - not wakashite, not the 沸 kanji itself - in other words I could do exact searches.
Right now, when I search for a number of characters together the engine seems to give me mixed results... Is this an issue or is this the way it is going to be from now on? I think it completely destroys the aim here, and I'd love to know if there is a way to make exact searches in Japanese.

This is an issue, as we've switching from an engine to an other, we will try to fix this asap

Oh great - I thought it would be permanent. Great to hear this!

ENAMDICT ?
What about including names in the index information? I know Jim isn't going to want it in the output he uses - but I suppose it could be stripped from the WWWJDIC.csv file.
Possible advantages include
* Ability to link to ENAMDICT from sentences (when linking from words in sentences is developed).
* Might be useful in checking MeCab output.

New icon?
What's the blue circle with an 'i' in it do?

It's actually not so new. You can also see it from here:
http://tatoeba.org/sentences/my_sentences
We used to display it in the sentences menu, so that people could go to the standard display (in Browse). Then we took it out and made the text of the sentence as a link instead.
But then there was a problem when you adopted a sentence, you couldn't browse to the sentence because it would be editable (clicking on it would display the input field to edit the sentence).
So we added back this little icon, so that you can browse to the sentence even when it belongs to you.

Suggestion box
Just thought of a quick idea for comments / wall postings.
Some sort of BBcode or tag system to mark sentence numbers in coment text so they can be turned automatically into links.
e.g. "See also ##250165"
would turn into
"See also <a href=tatoeba.org/eng/sentences/show/250165>250165</a>"

Yes, it's something I've been thinking about as well... But the question is, what format to use?
As far as I'm concerned, I'd tend to write it with only one #.
"See also #250165"
But perhaps it's too simple and there can be issues with this, I don't know.

How about using the "nº"? It will hardly be used for anything else and is available on every page by the sentence number, ready to be copied along with it.

Except that there is no "nº" displayed on 'the Wall' so it would only really work for comments there, not posts here as well.

Why would there need to be one displayed on the Wall? It's where you look up the sentence number.
... well, unless you have it memorised.

Sentence numbers are also included in the csv export files. (Which I often work from)
OK, I'll admit it - I just don't like nº's.

Actually, perhaps it's just easiest to go with something dead-easy. The #<number> is nicer than the ##<number> unless the latter markup were trimmed somehow, but that would make the usage unclear to new users.
The occasional false-positive wouldn't be obtrusive and well balanced out by getting people to drop the use of the hash character in their writing.
Oh, and if we're building a bike-shed, I want it red and will fight anyone who thinks otherwise tooth and nail!

I think ## is better

I also recommend ## rather than #, because you will be less likely to get false positives.

A format I've seen other places is sentence #123, which is convenient because you can generalize it to list #123, wall #123, etc. Sentence links may be common enough to warrant a shorter special syntax, though.
I also think the syntax should be kept in the final post, e.g. if you use ##, it should turn into <a href=tatoeba.org/eng/sentences/show/123>##123</a>, so new users can immediately tell how it's done. Using the format other places, like in the message logs, also helps discoverability.

I was just about to suggest the same thing. This would be immensely useful.

I've been wondering how best to make notes of genders when translating from languages such as English that don't always differentiate between the different genders.
So far I've just been using the comments, but was figure this might eventually go into metadata. There are several interesting ways to tackle that, but in case it might help me pick what to put in the comments and how to form it, has it been decided how issues like these will be tackled? Should I perhaps just be adding an extra sentence for each variation?

I think this is one of a number of things that are 'on the backburner'. I don't think there's likely to be much done about it for some time.
I wouldn't go the route of adding extra sentences as that would just produce needless duplication of content.
I think what we need is
* Meta-data that is not included as annotations, but in a separate field.
* A method of showing / hiding meta data associated with a sentence.
* A format for entering the metadata, and translations for them (at least the most common ones).

Extra sentences wouldn't actually be needless duplication. They're essentially equally valid (though partially identical) sentences that happen to translate to the same sentence in at least one language. The main reason I haven't created them is that I'm hoping for a much more elegant solution later on.

Look there are 151,909 Japanese sentences. Each could have a feminine and a masculine variant. Each could be plain form or (-masu) form, and many would also have one or more extra polite versions. That would take you up to over 600,000 sentences (and probably give Trang a heart attack). If that isn't needless duplication I don't know what you'd call it.

I'd call it unhelpful.
Genders in Japanese are simple in that they don't affect conjugation. In other languages they do.
Take for example the Icelandic phrases:
o Hann var keyptur/seldur/gefinn. Þeir voru keyptir/seldir/gefnir.
o Hún var keypt/seld/gefin. Þær voru keyptar/seldar/gefnar.
o Það var keypt/selt/gefið. Þau voru keypt/seld/gefin.
that translate to English as:
o {{ '{{' }}He, she, it} was, they were} bought/sold/given.
As you see they are highly irregular and therefore a student of the language would be well served with examples of each.
The only problem I see with adding all these sentences lies in storing their relationships to each other. Cluttering the search results or translations overview with multiple translations for every Finnish pronoun is just silly. Such translations wouldn't be unnecessary though they wouldn't ultimately be helpful, and why metadata would really come in handy.
As for proliferation of sentences, I'd imagine they'd give Trang great joy -- provided that they're useful, of course. That's the point of the project, after all.
At 500 sentences a week, I'll hit 600,000 in 23 years. :-)
The community at large has, however, contributed an average of 324 sentences a day since the end of the late march slump which would add that many sentences in a little over five years, bringing the total number to 974,851 sentences.

ugh...I've got the same issue in Arabic, sure other languages do have it to. What I did was keep adding different conjugations to different sentences until I felt like I covered them all, and if I did add another conjugation to the same base sentence, I make sure to use different synonyms and wordings to say the same idea.
Nonetheless, I do agree that some kind of metadata to indicate conjugation (or any other nuances) in the future would be nice.

*too

As for the metadata... I believe all those metadata should be attached to the original sentence, not to the translated one. I.e., [M], [F] should be not in Finnish (like they currently are), but to the English sentence that has gender disctinction (he/she). And Finnish should go unmarked.
Speaking on the metadata, I believe we also need a metadata for transcription, to fix it when it cannot be correcly generated automatically.
And also author and origin information for some sentences would be nice to have.
And a simple possibility to add IPA to sentences in any language would also be nice, though I'm not sure if this is really necessary.

I'm not quite sure what you mean by translated versus original sentences. Surely the idea is that they're equivalent. I do agree with you that the metadata should be associated with the more specific language.
It would be nice if one could define, for example, the genders of pronouns and contribute variations on these. That way, if one came across a Finnish sentence such as nº354807:
He väittivät, että hän tappoi hänet.
one could choose from the various Icelandic versions:
{Þeir, Þær, Þau} kváðu {hann, hana} hafa drepið {hann, hana}.
though implementing that could be a bit tricky.
I reckon it might be wiser to focus on improving the automatic generation algorithms and encourage people to report errors than adding comments on the readings.
As more people are able to record their voices than transcribe into IPA, these are probably largely unnecessary here on this project.
It might however be useful to get readings for symbols such as numbers. I tend to write these out rather than use numerals.

> Surely the idea is that they're equivalent.
Hmm... I thought that metadata is to be added by a translator to expain sentence that contains information that was lost or added during his translation, that’s why I’ve said about original sentence.
It shouldn’t be obvious from a sentence that it’s the original/a translation, of course.
> I reckon it might be wiser to focus on improving the automatic
> generation algorithms and encourage people to report errors than
> adding comments on the readings.
I do agree that improving algorithms is important. But there are situations when it’s impossible to generate transcription automatically because it requires complex grammar analysis or even understanding of the situation in which the sentence can be said.
Moreover, adding metadata for transcription will help to easily find problems with our current transcription algorithm. If people write this in comments, they’ll get lost. If a special field it designated for a transcription, these can be easily found by a DB search by those who improve the algorithm.
> As more people are able to record their voices than transcribe
> into IPA, these are probably largely unnecessary here on this project.
I’m not sure about this. Most people have at least some understanding of transcription and IPA because it is used at school.
The problem is with voice files is the high-quality required. As for me, it’s certainly easier for me to transcribe something than to to buy a microphone and learn how to use it.

OK, now I understand what you mean by "original" sentence. Personally I'm not quite sure where the information would best be stored. I've been sort of leaning towards only storing information about added information (i.e. more specific sentences). One of the problems with the information is that a sentence may be connected to several others and indicating what metadata refers to which "original" sentence might be difficult.
There are several solutions to this problem, but the ones I've been thinking may well be too complicated. Frankly, the current situation doesn't really bother me all that much.
While we're calling the solution different things, I completely agree with you on the transcription issue.
On the IPA, I'm not convinced of it's usefulness, but if you are, there certainly will be others who'd agree and benefit from it.

> I'm not quite sure what you mean by translated
> versus original sentences. Surely the idea is that
> they're equivalent.
Yeah, that's the _theory_, but it often isn't the practice. To take one obvious example, if one of the sentences is a quote then the original source was only in _one_ of the languages - all the rest must be translations.

Japanese index update
I've sent in an update to 716 records for the Japanese index data to the team@tatoeba.fr address - hope you can sneak them in. ;-)

I have been amending the indices in situ in a few places, so I hope we
don't overlap. Perhaps we need an RCS of some sort.

If there was a 'last changed' date field (to the nearest day) the SQL on the update could be made to avoid changing entries last changed past a certain day.
The risk of overlap isn't going to be huge though, so anything too complicated or tricky to implement may actually reduce work efficiency.