Wall (7,273 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
alt
2 hours ago
frpzzd
2 hours ago
AlanF_US
4 hours ago
CK
7 hours ago
gillux
8 hours ago
LeviHighway
11 hours ago
alt
11 hours ago
Babelball
2 days ago
cafoc64474
3 days ago
LeviHighway
3 days ago
'Translations' from English to English.
I don't think I've had an official policy statement on this, so what is the position on 'translating' between the _same_ language?
I think it is inadvisable and would suggest that there be a warning put up about it and a database search done to create a list of problematic links.
the question has already been asked and the answer is "yes we can"
because some sentences can have exactly the same meaning
after if the two sentences are different because one is "old" english and one "youth" english then they should not be linked
Later in the future, we plan to have "qualified" links which mean in the future you will be able to link with a "translation" link a "synonynms" links "antonyms" link etc...
and with this you will be able to link an "old" english and "youth" english version of 2 sentences
so for the moment you add an english translation to an other english ONLY if the 2 sentences can be translated by the same sentences (you see what I mean)
I think that should be noted prominently somewhere like
http://blog.tatoeba.org/2010/02...n-tatoeba.html
Yep you're right
I think it's a reasonable thing to do to explain archaic language and idiomatic expressions.
I've sometimes added alternative translations for such sentences in the English sentences needing confirmation list, though not by linking together the English sentences. I don't see any reason why two same-language sentences should not be linked together if they have the same meaning, though.
> I don't see any reason why two same-language
> sentences should not be linked together if
> they have the same meaning, though.
I dislike that on principle because Tatoeba is a collection of translations, not a collection of explanations.
On a more practical level it can lead to direct translations being shown as if they were indirect, and indirect translations not showing up at all.
Suppose you have three sentences:
A (Japanese), B (English), C (English)
Assume that B is the translation of A and B and C have the same meaning.
If that is stored as
A <----> B <----> C
then C will show up as an indirect translation of A.
If it is stored as
A <----> B
\----> C
Then B and C will both be visible as direct translations of A.
If you then had a French sentence, D, as a translation of C then in the second pattern it would not be visible from the Japanese sentence at all.
> I dislike that on principle because Tatoeba is a collection of translations, not a collection of explanations.
How about if it had been a "language learning resource"? :)
Both the cases in your example are already commonly seen for sentences in different languages, with the problems you state, and should be solved by linking together all three.
> and should be solved by linking together all three.
That just means that if someone gets it wrong first then somebody else has to go round fixing the problem later. Of course it won't be possible to guarantee that it is _always_ the right thing to do because it is quite possible to have an
A, B and C
where
B (English) is a translation of A (Japanese),
C (English) can have the same meaning as B (English),
but
C is not a translation of A
(Repeat for other languages as necessary).
I just think same-language linking is an unnecessary added layer of complexity that will come back to bite us (and particularly those* working on things at a database level) later.
* i.e. me.
Again, I don't see how that's any different when B and C are different languages.
[not needed anymore- removed by CK]
> It's hard enough to learn a foreign language without
> complicating it by an over exposure to archaic language.
I am certainly in favour of having the simplest English translation being the one displayed in WWWJDIC. I just don't want the difficult-but-valid stuff deleted. The problem with that is the question of what the users are trying to learn. Simply put, if you take out the difficult language then they will have no opportunity to learn the hard stuff.
If someone comes across a confusing word or phrase in a foreign language they're going to want to find that word/phrase - not be left hanging by an over-sanitized 'learners' resource.
On a personal note, if you happen to be widely read - particularly in works of historical and/or fantasy fiction - then you're likely to have a much richer vocabulary in this area than otherwise.
[not needed anymore- removed by CK]
I don't get this. The other language _do_ show while you're typing in a translation. Are you sure you're not
http://content.pyzam.com/funnyp...ngItWrong6.jpg
?
I see the same behaviour. When you click on the "あ->a" image it turns the list of translations into a form to enter the new translation.
Personally, I don't mind this behaviour. The idea is to translate that one sentence, which may or may not correspond to the other translations. I do admit that I often read the other translations to better understand the context.
Oh, now I'm with you! Yes, when you're translating the only sentence you should really be paying attention to is the one you are translating from.
If you don't understand the sentence you're working from you shouldn't be translating it in the first place.
I don't think CK meant he didn't understand the question, but just that he wanted to make the translation better match the others.
In fact it was hidden, because before people really often don't understand the fact there is a "main sentence" and others are just translations and so most of the time they were translating one of the translation instead of translating the main one. So hidden + message in red was the simplest way we found yet to make it more obvious.
Hello. I have a problem. I want to search for sentences in japanese, but the search function seems to have changed. I used to type in 沸かす for example, to see exactly when it is used completely as "wakasu" - not wakashite, not the 沸 kanji itself - in other words I could do exact searches.
Right now, when I search for a number of characters together the engine seems to give me mixed results... Is this an issue or is this the way it is going to be from now on? I think it completely destroys the aim here, and I'd love to know if there is a way to make exact searches in Japanese.
This is an issue, as we've switching from an engine to an other, we will try to fix this asap
Oh great - I thought it would be permanent. Great to hear this!
ENAMDICT ?
What about including names in the index information? I know Jim isn't going to want it in the output he uses - but I suppose it could be stripped from the WWWJDIC.csv file.
Possible advantages include
* Ability to link to ENAMDICT from sentences (when linking from words in sentences is developed).
* Might be useful in checking MeCab output.
New icon?
What's the blue circle with an 'i' in it do?
It's actually not so new. You can also see it from here:
http://tatoeba.org/sentences/my_sentences
We used to display it in the sentences menu, so that people could go to the standard display (in Browse). Then we took it out and made the text of the sentence as a link instead.
But then there was a problem when you adopted a sentence, you couldn't browse to the sentence because it would be editable (clicking on it would display the input field to edit the sentence).
So we added back this little icon, so that you can browse to the sentence even when it belongs to you.
Suggestion box
Just thought of a quick idea for comments / wall postings.
Some sort of BBcode or tag system to mark sentence numbers in coment text so they can be turned automatically into links.
e.g. "See also ##250165"
would turn into
"See also <a href=tatoeba.org/eng/sentences/show/250165>250165</a>"
Yes, it's something I've been thinking about as well... But the question is, what format to use?
As far as I'm concerned, I'd tend to write it with only one #.
"See also #250165"
But perhaps it's too simple and there can be issues with this, I don't know.
How about using the "nº"? It will hardly be used for anything else and is available on every page by the sentence number, ready to be copied along with it.
Except that there is no "nº" displayed on 'the Wall' so it would only really work for comments there, not posts here as well.
Why would there need to be one displayed on the Wall? It's where you look up the sentence number.
... well, unless you have it memorised.
Sentence numbers are also included in the csv export files. (Which I often work from)
OK, I'll admit it - I just don't like nº's.
Actually, perhaps it's just easiest to go with something dead-easy. The #<number> is nicer than the ##<number> unless the latter markup were trimmed somehow, but that would make the usage unclear to new users.
The occasional false-positive wouldn't be obtrusive and well balanced out by getting people to drop the use of the hash character in their writing.
Oh, and if we're building a bike-shed, I want it red and will fight anyone who thinks otherwise tooth and nail!
I think ## is better
I also recommend ## rather than #, because you will be less likely to get false positives.
A format I've seen other places is sentence #123, which is convenient because you can generalize it to list #123, wall #123, etc. Sentence links may be common enough to warrant a shorter special syntax, though.
I also think the syntax should be kept in the final post, e.g. if you use ##, it should turn into <a href=tatoeba.org/eng/sentences/show/123>##123</a>, so new users can immediately tell how it's done. Using the format other places, like in the message logs, also helps discoverability.
I was just about to suggest the same thing. This would be immensely useful.
I've been wondering how best to make notes of genders when translating from languages such as English that don't always differentiate between the different genders.
So far I've just been using the comments, but was figure this might eventually go into metadata. There are several interesting ways to tackle that, but in case it might help me pick what to put in the comments and how to form it, has it been decided how issues like these will be tackled? Should I perhaps just be adding an extra sentence for each variation?
I think this is one of a number of things that are 'on the backburner'. I don't think there's likely to be much done about it for some time.
I wouldn't go the route of adding extra sentences as that would just produce needless duplication of content.
I think what we need is
* Meta-data that is not included as annotations, but in a separate field.
* A method of showing / hiding meta data associated with a sentence.
* A format for entering the metadata, and translations for them (at least the most common ones).
Extra sentences wouldn't actually be needless duplication. They're essentially equally valid (though partially identical) sentences that happen to translate to the same sentence in at least one language. The main reason I haven't created them is that I'm hoping for a much more elegant solution later on.
Look there are 151,909 Japanese sentences. Each could have a feminine and a masculine variant. Each could be plain form or (-masu) form, and many would also have one or more extra polite versions. That would take you up to over 600,000 sentences (and probably give Trang a heart attack). If that isn't needless duplication I don't know what you'd call it.
I'd call it unhelpful.
Genders in Japanese are simple in that they don't affect conjugation. In other languages they do.
Take for example the Icelandic phrases:
o Hann var keyptur/seldur/gefinn. Þeir voru keyptir/seldir/gefnir.
o Hún var keypt/seld/gefin. Þær voru keyptar/seldar/gefnar.
o Það var keypt/selt/gefið. Þau voru keypt/seld/gefin.
that translate to English as:
o {{ '{{' }}He, she, it} was, they were} bought/sold/given.
As you see they are highly irregular and therefore a student of the language would be well served with examples of each.
The only problem I see with adding all these sentences lies in storing their relationships to each other. Cluttering the search results or translations overview with multiple translations for every Finnish pronoun is just silly. Such translations wouldn't be unnecessary though they wouldn't ultimately be helpful, and why metadata would really come in handy.
As for proliferation of sentences, I'd imagine they'd give Trang great joy -- provided that they're useful, of course. That's the point of the project, after all.
At 500 sentences a week, I'll hit 600,000 in 23 years. :-)
The community at large has, however, contributed an average of 324 sentences a day since the end of the late march slump which would add that many sentences in a little over five years, bringing the total number to 974,851 sentences.
ugh...I've got the same issue in Arabic, sure other languages do have it to. What I did was keep adding different conjugations to different sentences until I felt like I covered them all, and if I did add another conjugation to the same base sentence, I make sure to use different synonyms and wordings to say the same idea.
Nonetheless, I do agree that some kind of metadata to indicate conjugation (or any other nuances) in the future would be nice.
*too
As for the metadata... I believe all those metadata should be attached to the original sentence, not to the translated one. I.e., [M], [F] should be not in Finnish (like they currently are), but to the English sentence that has gender disctinction (he/she). And Finnish should go unmarked.
Speaking on the metadata, I believe we also need a metadata for transcription, to fix it when it cannot be correcly generated automatically.
And also author and origin information for some sentences would be nice to have.
And a simple possibility to add IPA to sentences in any language would also be nice, though I'm not sure if this is really necessary.
I'm not quite sure what you mean by translated versus original sentences. Surely the idea is that they're equivalent. I do agree with you that the metadata should be associated with the more specific language.
It would be nice if one could define, for example, the genders of pronouns and contribute variations on these. That way, if one came across a Finnish sentence such as nº354807:
He väittivät, että hän tappoi hänet.
one could choose from the various Icelandic versions:
{Þeir, Þær, Þau} kváðu {hann, hana} hafa drepið {hann, hana}.
though implementing that could be a bit tricky.
I reckon it might be wiser to focus on improving the automatic generation algorithms and encourage people to report errors than adding comments on the readings.
As more people are able to record their voices than transcribe into IPA, these are probably largely unnecessary here on this project.
It might however be useful to get readings for symbols such as numbers. I tend to write these out rather than use numerals.
> Surely the idea is that they're equivalent.
Hmm... I thought that metadata is to be added by a translator to expain sentence that contains information that was lost or added during his translation, that’s why I’ve said about original sentence.
It shouldn’t be obvious from a sentence that it’s the original/a translation, of course.
> I reckon it might be wiser to focus on improving the automatic
> generation algorithms and encourage people to report errors than
> adding comments on the readings.
I do agree that improving algorithms is important. But there are situations when it’s impossible to generate transcription automatically because it requires complex grammar analysis or even understanding of the situation in which the sentence can be said.
Moreover, adding metadata for transcription will help to easily find problems with our current transcription algorithm. If people write this in comments, they’ll get lost. If a special field it designated for a transcription, these can be easily found by a DB search by those who improve the algorithm.
> As more people are able to record their voices than transcribe
> into IPA, these are probably largely unnecessary here on this project.
I’m not sure about this. Most people have at least some understanding of transcription and IPA because it is used at school.
The problem is with voice files is the high-quality required. As for me, it’s certainly easier for me to transcribe something than to to buy a microphone and learn how to use it.
OK, now I understand what you mean by "original" sentence. Personally I'm not quite sure where the information would best be stored. I've been sort of leaning towards only storing information about added information (i.e. more specific sentences). One of the problems with the information is that a sentence may be connected to several others and indicating what metadata refers to which "original" sentence might be difficult.
There are several solutions to this problem, but the ones I've been thinking may well be too complicated. Frankly, the current situation doesn't really bother me all that much.
While we're calling the solution different things, I completely agree with you on the transcription issue.
On the IPA, I'm not convinced of it's usefulness, but if you are, there certainly will be others who'd agree and benefit from it.
> I'm not quite sure what you mean by translated
> versus original sentences. Surely the idea is that
> they're equivalent.
Yeah, that's the _theory_, but it often isn't the practice. To take one obvious example, if one of the sentences is a quote then the original source was only in _one_ of the languages - all the rest must be translations.
Japanese index update
I've sent in an update to 716 records for the Japanese index data to the team@tatoeba.fr address - hope you can sneak them in. ;-)
I have been amending the indices in situ in a few places, so I hope we
don't overlap. Perhaps we need an RCS of some sort.
If there was a 'last changed' date field (to the nearest day) the SQL on the update could be made to avoid changing entries last changed past a certain day.
The risk of overlap isn't going to be huge though, so anything too complicated or tricky to implement may actually reduce work efficiency.
I'm planning to add some Cantonese sentences. This might be a stupid question, but should I include the jyutping in the sentences?
As far as I can see, currently transcription isn't generated for Cantonese (eg. sent. No. 382502). But it's generated for Mandarin and Shanghainese, so I believe it'll be implemented in the future.
Yep you're right Demetrius, it will be added soon, adso as a beta support for jiutping, but if you know a good free software for Cantonese romanization, tell us :)
@nickyeow thanks to contribute (also) in Cantonese :)
I don't know any of these, unfortunately.
The only Cantonese wordlist with transcriptions I’ve seen is here: http://e-guidedog.sourceforge.net/cantonese.php , but it is inaccurate according to its creators.
No, you shouldn’t. All transcription is generated automatically. Maybe it isn't generated for Cantonese (I don't know), but it should be generated in future.
But if you're inclined to, you can add transcription in comments. :)
It's good we'll have Cantonese sentences! :)
Btw, is Jyutping really employed more often than Yale? In all books I read and the course I attended Cantonese Yale was used.
In Hong Kong we usually use Jyutping, but you can also see Yale employed in some dictionaries.
At HKUST they teach Yale :)
Thanks for your answer! :)
Member status
I think it would be a good idea to be able to tell who has 'trusted user' status and who has 'admin access'. I suggest:
* Icon next to name in comments posted and next to posts on the wall.
* Full title given in the user's profile "%s is a Trusted-User" or something. Could make Trusted-User link to an explanation of what that means and how you get it.
There is a way to tell, but it's not obvious. From the "Members" section, if you re-organize by status, you can easily see who are the current trusted users:
http://tatoeba.org/eng/users/al.../direction:asc
But I agree it would be nice to be able to tell, right from the comments, what is the status of a user.