Burada Tatoeba'nın nasıl kullanılacağı, hatalar veya garip davranışların nasıl raporlanacağı gibi genel sorular sorabilir ya da en basitinden topluluğun geri kalanı ile kaynaşabilirsiniz.
Soru sormadan önce SSS'yi okuduğunuzdan emin olun.
En son mesajlar
Wall (3749 threads)
from switching between normal font and the smaller furigana. It appears
- at least to me - that hiragana is a better option. So a japanese sentence
could be displayed in original form with kanji and in hiragana form with
furigana turned off in the same font size . Further the original kanji part and
its hiragana form could be slightly highlighted for easier recognition.
Because I always find problem every time I want to search Arabic sentence. For example I want to find word ذهب (to go / he went) and the results are exactly the same. There are no result like ذهبت (you went), نذهب (we go), يذهبون (they go), أذهب (I go) etc.
and don't you think it will be very useful if we can search sentence examples of a language with its part or grammar, for example I want to search all Turkish sentences containing suffix -iyor/-ıyor/-uyor/-üyor without concerning the verb itself...
I’m currently working on improving transcriptions in Tatoeba. Currently, transcriptions are automatically generated by a piece of software that sometimes fails providing correct transcriptions. A few of these failures are tagged with “incorrect transcription”  or “furigana mistake” for Japanese.
I plan to address this problem in two ways:
(1) improve the software that generates transcriptions based on feedback from users;
(2) allow users to edit transcriptions so that they can fix problems themselves.
Both of these approaches have limits, depending on the accuracy of the current transcriptions, the type of errors and the language. For instance, if a given transcription error is very widespread, it’s better to fix the software rather than to fix the same error by hand on a large number of sentences.
In Japanese, I am knowledgeable enough to know that no software is capable of providing 100% accurate transcriptions, so I will use mostly approach (2), and only a little bit of (1). However, I don’t know if it’s a good idea to use (2) in other languages we provide transcription for, namely Chinese (both traditional/simplified conversion and Pinyin), Shanghainese, Cantonese, Uzbek and Georgian.
So my question is, for each these languages:
• To what extend the current transcriptions are accurate? Try to give a percentage.
• If this percentage is lower than 100%:
• Do you think it’s a good idea to systematically display autogenerated transcriptions (like we do now) although some are incorrect?
• What type of errors can you see in the transcriptions? Try to categorize. If you know software development, how easily can they be detected and fixed?
Feel free to translate this post in the concerned languages.
to answer your questions, I would say that
* the percentage of correctly transliterated sentences without any problems should be around 80 to 90 percent. I regularly notice errors, but when I pay active attention, I see that most sentences I come across are in fact transliterated correctly. however, my skills in both languages are only mediocre, so this might be somewhat off from the actual situation.
* yes, autogenerated transcriptions with flaws are better than none. just think of our contributors manually adding the transliteration to every single japanese sentence they own. I wouldn't want to have them do that, not to mention the fact that the majority of japanese sentences on tatoeba are orphans.
* the errors mainly concern characters with multiple readings (duh). the wrong choice of transliteration is then sometimes chosen by the tool for both morphological and syntactic reasons. e.g., in mandarin compounds, the problem is apparently that the compound is saved in the tool's database as one word, but with the wrong pinyin. but often single characters (i.e. where a character doesn't have any direct context with other characters that form a word with it) are wrongly transliterated if the tool cannot correctly analyze the syntax of a sentence. I have no idea about computing of any kind, so this is where my part ends.
I generally think that giving sentence owners the possibility to manually change a sentence's transliteration would be a very good first step. we could also add some kind of marking whether a sentence's transliteration is automatically generated or hand-made.
stoked to see this being improved! I'll be happy to offer all the help I can.
About the Japanese furigana/transcription
> * yes, autogenerated transcriptions with flaws are better than none.
I strongly disagree. To me, there is no point into showing furigana if it’s not trustworthy. It’s no good for Japanese learners because they may think it’s trustworthy whereas it’s not, and because they eventually need to know the readings by themselves to be able to judge whether it’s correct or not, which defeats the original purpose. It’s no good for Japanese speakers neither because they may think Tatoeba is not a serious project. See also http://en.wiki.tatoeba.org/articles/show/furigana.
Because of this, I plan not to show furigana by default unless reviewed, and to allow showing untrustworthy transcriptions by default for every sentence like now with an option.
> just think of our contributors manually adding the transliteration to every single japanese sentence they own.
Yes, the task of manually adding transcriptions for every sentence is huge but I think it’s the only way. I plan to ease that process by using autogenerated transcriptions as a base one could review by editing the wrong parts only. Something like this:
1. Show autogenerated transcription by clicking a button http://prntscr.com/71kvcc
2. Edit and send it http://prntscr.com/71kvvy
3. It’s reviewed http://prntscr.com/71kwho
In addition, I plan to allow reviewing transcriptions of other contributor’s sentences unless the sentence owner reviewed it.
What do you think?
> Which ones?
the two I mentioned, japanese and mandarin.
>> * yes, autogenerated transcriptions with flaws are better than none.
> I strongly disagree.
haha wow, alright.
I see your point, though. for me personally, not having furigana on most japanese sentences would be a bummer because I use tatoeba all the time to check the readings of sentences I have in anki, but I can just go back to using jisho.org for this purpose (which has issues similar to those on tatoeba, but maybe a tad less? I'm not sure). if there is a furigana mistake in a sentence, I usually remember it, so I can maneuver around those when studying in anki. but yeah, I see this is a relatively specific need, so I can't assume the majority of tatoeba users will have it as well. I won't be seeing japanese sentences on tatoeba as much as before, thus having less opportunities to notice any problems, but, oh well.
HOWEVER. I really like your idea of introducing a "show autogenerated transcription with a warning label" button. it sounds like a very good compromise between getting rid of wrong transliterations and yet maintaining the possibility to get a transcription for any sentence right away at your own risk. it would also raise the awareness that there are errors by a mile. so:
> What do you think?
I think you should go for it. try the same thing for mandarin, we'll see how it works out.
> I see your point, though. for me personally, not having furigana on most japanese sentences would be a bummer because I use tatoeba all the time to check the readings of sentences
That’s why I thought about having an option to display transcriptions by default like it is now. Members would need to opt-in by going to their settings, and this would be a good chance to warn them about untrustworthy transcriptions. I will probably implement this in a later update though, it’s not essential.
Will the romanisation be editable as well? There are some strange word separations such as "ni hiki" and "i masu" instead of "nihiki" and "imasu".
You can, but not directly. I don't want to make it fully editable because I feel it's gonna become too much inconsistent, and because it is based on the furigana. Instead, if you feel like editing the romaji word separation, edit the furigana (remove or insert spaces) and the romaji will update according to it.
I already dealt with the problem you're mentioning (verbs like "i masu" separated) in a recent update, but only for the furiganas. They now display います instead of い ます, and so will romajis once I'll be done with this.
As a Cantonese learner, I'm using the Cantonese transcription quite a lot. It does have flaws (hard for me to estimate the percentage, will try to take several pages and count the number), but it is useful for me because I have problems memorising the tones, and I usually can spot most transcription errors. Generating a completely correct transcription is very complex programmatically, probably even more complex than for Standard Written Chinese, because SWC is more codified.
Your comment about Cantonese transcriptions and pullnosemans’s about Mandarin transcriptions suggest that despite they are inaccurate, these transcriptions are still very useful to you so I will definitely need to implement the “display everything by default” option before the first release.
Speaking of which, I’d like to have your opinions (especially Pfirsichbaeumchen’s since you’re admin) about how to manage transcriptions edition permissions. I tried to find a balance between keeping it open yet preventing edition problems. There are so many transcriptions that it doesn’t make much sense to me to only allow sentence authors to submit transcriptions of their sentences.
For a given sentence of a language in which we allow transcription edition:
• If nobody submitted a transcription, anybody may submit one.
• If you’re the owner of the sentence and someone submitted a transcription, you may overwrite it with your own.
• Once the sentence owner submitted a transcription to his/her own sentence, only he/she, corpus maintainers and admins may further modify it.
It sounds about right for transcriptions that are well-known by natives such as in Uzbek Cyrillic/Latin or Japanese furigana, but what about Pinyin or Jyutping? To what extend Mandarin and Cantonese speakers are confident with these? And for Shanghainese, the current transcription is based on IPA, but I believe most learners can’t read it and most natives can’t write it… What do we do with this?
In my experience, it's easiest to talk about pronounciation with native speakers by using other characters with the same pronounciation. But I'm not sure how this can be implemented in an intuitive way. Probably, have an on-hover hint for all the syllables?.. :?
The situation with Pinyin seems better. If I'm not mistaken, it is taught at school in Mainland China (but I believe Taiwan uses Zhuyin instead).
1. A non-native speaker provides a wrong transcription for a bad sentence.
2. Native speakers don't want to correct the transcription because that might give the wrong impression that it were a good sentence, nor do they want to correct the sentence because that would prompt non-native speakers to add even more bad sentences. As a result, the wrong transcription remains uncorrected.
3. People "learn" from it, believing it to be trustworthy because the transcription isn't machine-generated.
I don’t think there is an easy solution to this, it’s just like trying to deal with people adding incorrect sentences to Tatoeba.
Of course, this should ideally be a selectable option (probably selectable in the settings) because most Chinese learners do want to read it with Mandarin pronounciation, not with Cantonese.
On a side note, it would be nice if pronunciations of compounds could be displayed with spaces in between the individual syllables. 'zung6 jiu3' looks much better than 'zung6jiu3' in my opinion.
I think the autogenerated transcriptions for Cantonese are doing okay. There are mistakes here and there, but most of them are caused by a small set of characters with multiple pronunciations. Many of these can be solved by adding more pronunciations for compounds. I'd say the percentage of sentences with completely correct transliterations is around 90% to 95%.
Some characters can be quite tricky though. For instance, the final particle 喎 is pronounced wo3 when it indicates a casual remark, wo4 when it indicates a sort of playful scolding, and wo5 when it is used to quote something undesirable. Apparently, it can also be pronounced kwaa1, waa1, and wo1, although these are extremely rare. To be fair, I only found out about these (I blush to confess) when I looked them up in a dictionary. :p
I think the review system you suggested would be the best way to solve the problem. Perhaps autogenerated transcriptions could still be displayed—they are correct most of the time after all—but it would be nice if reviewed transcriptions could be given a little green tick or something.
Perform a regular search, and then you’ll see additional criteria on the right: sentence owner and orphan sentences for the moment. I made orphan sentences hidden by default. This way, they are hidden from top bar searches, but can be displayed by checking the additional criterion, lowering their visibility to newcomers.
What do you think?
It becomes this query when I search for my sentences corresponding to that query: https://dev.tatoeba.org/ita/sen...ser=Guybrush88
As you can see, no results are shown because the accent is changed by the query, while sentences I own are shown without specifying my username
Would it be possible to add more than one username in a comma-delimited list, similar to how members can limit languages in their settings?
For example, here is a list of the Japanese native speakers who have contributed the most sentences.
This would allow members to search for sentences by members that they feel can be trusted.
Would it be possible to allow us to also limit searches to only sentences with audio?
I'd suggest this change in wording.
Oprhan sentences are likely to be incorrect.
Orphan sentences are less likely to be correct.
an automatic 'native speakers' filter would probably be cool, too, but I also very much agree with sacredceltic's caveat below; you just never know who claims to be native. having an individual list as in ck's suggestion #1 would be a good way to cope with this problem.
I don't think, however, that hiding orphans should be the default in the way that you have to check "show orphans" every single time you submit a search query. I think this would lead to a decrease in orphans being adopted and amended. let's rather have it so that you can check "show orphans" and it stays like that until you manually uncheck it again.
it's great seeing this site improving constantly!
I don’t really like the idea of providing a comma-separated list instead of filtering by self-proclamed natives. First, because it’s rather impractical to use as the list grows. Second, because it restricts the ability to filter by native speakers to a handful of long-time contributors who have their own idea on that matter. I’m worrying about newcomers (who obviously won’t express themselves in this thread) being unable to use the search as efficiently as you guys would. That would be unfair. The current lack of native speakers identification and proper review mechanism to sort out “bad” sentences should be solved first, rather than worked around by that kind of “feature”. I can already see members providing ready-to-use search links in their profiles that filters users from their list. That said, filtering by multiple users itself (regardless of the motivation) seems legit, and is easy to implement.
I agree about what you said about orphans visibility. I initially wanted to limit the visibility of orphans because they are a major problem in some languages like Japanese where more than the half of the corpus are orphans that are mostly wrong. But that’s another problem.
If it's difficult to program this capability, I can understand.
However, being able to search for sentences by more than one username would be useful.
For example, ...
1. People could limit searches to sentences owned by Brazilian Portuguese speakers, or Mexican Spanish speakers if they knew which members spoke which dialect.
2. People could use all the native speakers listed on http://bit.ly/nativespeakers rather than just the few that are listed using the new system on tatoeba.org. We have a lot of sentences written by native speakers that are never likely to come back and change the setting in their profiles.
3. People could choose to exclude certain self-proclaimed native speakers that they didn't trust.
4. People could choose to also include a few non-native speakers that they feel they can trust.
5. Some researchers may want to study typical English errors made by native Russian speakers, so they could browse through search results limited to English sentences written by Russian speakers.
It would probably also be a good idea to have the “limit to sentences by self-proclaimed natives” that you are suggesting.
** Added 6 hours later **
Here are the number of members claiming "native speaker level" in more than one language
1 member claims 4 languages at native level.
7 members claim 3 languages at native level.
54 members claim 2 languages at native level.
This is based on the exported data of May 23, 2015.
I wonder if other members are as skeptical as I am about these claims.
This is one reason I'd like the option to search with results limited to usernames of my own choosing.
If you want to see the usernames, go to http://goo.gl/K8vGKl.
There are perhaps a few on this list that I might trust as being true native speakers of two languages.
I updated http://bit.ly/nativespeakers so now you can easily copy a comma-delimited set of usernames for each language.
If searching by multiple usernames is enabled, you can easily go here and copy the usernames, and then edit out members you don't trust (if there are any).
> 2. People could use all the native speakers listed on http://bit.ly/nativespeakers rather than just the few that are listed using the new system on tatoeba.org. We have a lot of sentences written by native speakers that are never likely to come back and change the setting in their profiles.
How about incorporating the information on this page into the official system? Would anyone object to it?
Yes. I’ll implement this.
> I would also find it useful if i could see all the sentences with a given expression/word that are translated in a given language. for example: i search for "apple pie" and i want to see only the sentences containing "apple pie" that have translations in Italian
You mean https://tatoeba.org/sentences/s...eng&to=ita ?
Many online-dictionaries I use have a drop-down list where you can choose what kind of search you want to make. For example, this Japanese dictionary http://dictionary.goo.ne.jp/ has options "begins with", "exact match" and "ends with" and you can specify your search with those.
I would also like to see something like that in Tatoeba. So there would be next to the search field another drop-down list with options to choose, eg.
- vague matches (eg. "live in boston" or "live") <-- this would be the default. I'm assuming the quotation marks don't do anything if you are searching with only one word, eg. the search "live" returns the same results as plain live, right?
- exact matches (eg. "=live =in =boston" or "=live") (though this wouldn't work when searching phrases in languages without spaces, I guess)
- begins with (eg. "^live in boston" or "^live")
- ends with (eg. "live in boston$" or "live$")
+ maybe something else, like "begins and ends with" (eg. "^live in boston$" or "^live$".)
Now, if someone chooses to "unown" a sentence, the OK tag disappears, so we lose important information.
In the past, when a non-native English speaker chose to release all their English sentences, I could easily find all of their sentences that I had tagged OK and adopt those sentences.
Definitely, sign language may have problems with being compatible with the current state of Tatoeba.org but what about smileys, capitalizing words for emphasis, sarcasm, irony etc.?
Which parts of languages are allowed to be added to the database and which are not?
problem with using things like *this* or -this- for emphasis is that there is no consistent code how to use them, so they might be interpreted differently from what you wanted to express. then again, I think people will have enough empathic intuition to figure it out in the majority of cases.
(seeing as CK is an english and japanese bilingual, this was originally directed to him in a pm, but then I thought it couldn't hurt to just make it public to the community and include mandarin.)
I have a question about something that has been growing more and more odd to me here on tatoeba.
if an english sentence contains a constituent with the definite article "the" (or one of the german equivalents die/den/das/etc.), a lot of japanese and mandarin sentences are translated using the demonstrative "その" in japanese or "这/那" in mandarin. I used to think it's weird, but assumed that I simply did not know the two languages well enough to be able to judge.
japanese example: http://tatoeba.org/fra/sentences/show/208196
chinese example with "这": http://tatoeba.org/fra/sentences/show/793355
however, lately I've been noticing that my chinese tandem partner uses the demonstrative "diese/r/s" (the german equivalent of "that") where she should be using the definite article, so I've been wondering: maybe speakers of languages without articles are explained the usage of the english or german definite article by means of pointing to something, or explaining to them that unlike the indefinite article (english "a", german "ein/e"), the definite article refers to something specific. because of this, I've been wondering whether this could actually be the cause of all the translations of "the" by means of "その"/"这/那".
so, to finally get to the point: what do you think about translating something like
"The dog ate a carrot."
instead of something like "犬はニンジンを食べた。"
instead of something like "狗吃了胡萝卜。"
Do you think these translations are okay and "その"/"这/那" can be used in japanese and chinese respectively in this way, or do you think it's actually an unfitting translation and that only german/english demonstratives ("diese/r/s", "this/that") should be translated using demonstratives in mandarin and japanese? if yes, do you think the mistranslations might be caused by a frequent misunderstanding of the nature of english/german articles by native speakers of languages without articles?
very curious to hear what you guys have to say.
2. Obviously その in Japanese is used far less often than a definite article in Western languages.
3. At school, we learn to "translate" English sentences into a weird and clumsy Japanese. I once "translated" the English translations of some of my sentences into "Japanese" for fun.
"Is that Tom calling again?" "Yes. He calls every evening these days. I shouldn't have given him my number."
"Tom, your dinner's getting cold." "Just a minute. I'll be right there."
When you ask Japanese to translate something into Japanese, there's a high probability that their "translation" looks like this. At least that was the case for most of the students at Hyogo University and contributors on Tatoeba (including myself in my earlier days).
Some words are used markedly more often than the real Japanese, such as personal and demonstrative pronouns, and there are also many other differences.
4. Most Japanese sentences here are not wrong, but there are so many sentences that can only used in limited situations. Some of them are so stilted that you can use them only when you write, even if they look like an example of spoken language. Some of them sound too impolite or vulgar that you should use them when you're talking with close friends. So, if you're seriously interested in learning Japanese and not yet good enough at it to tell the nuance just by reading a sentences out of context, you'd better ignore Japanese sentences on Tatoeba.
tommy, I think your very first remark is actually quite interesting. definite articles in english and german are indeed used to refer to something that is already known in discourse. however, on tatoeba, we of course don't have discourse, so that might be a problem when translating them into languages without direct equivalents of the articles.
I'm right now guessing that you could say the translations with demonstratives for definite articles are acceptable if we interpret them all in this way (at least for me personally, because I use tatoeba mainly to boost vocabulary anyway).
still curious to hear more opinions.
> I use tatoeba mainly to boost vocabulary anyway
Many Japanese sentences here sound somewhat like "Er nahm ein Foto von dem Hunde." You can learn many words from it: nehmen = take, Foto = picture, etc. but the problem is not many German speakers "nehmen" pictures or say "dem Hunde" nowadays. If you don't mind it, just go ahead. If you're more serious about learning Japanese, you may want to take a look at http://yourei.jp/. It lists tons of real Japanese sentences.
I'm right now curious how accurate my sense of style in japanese has grown so far, and whether or not I can roughly rely on it in my selection of japanese phrases on this site, so just a very quick test for myself: am I right assuming that "熱をお計りになりましたか。", an orphan sentence from this site, would be an example of a less-than-natural sentence? it appears pretty weird to me. do you think that staying away from sentences with syntactic structures that appear overly cumbersome to me would be a good way to filter out unnatural sentences?
also, it would be important for me to know: can I dodge stylistically outdated sentences on here by avoiding orphans, or sticking to certain contributors?
I'd write 測る instead of 計る and I'd say (熱は or お熱は) instead of 熱を.
お測りになりましたか is perfectly fine (it's by no means outdated), though it might be more common to say 測られましたか.
> can I dodge stylistically outdated sentences on here by avoiding orphans, or sticking to certain contributors?
Sentences added by native speakers NOT as translations are usually good.
Sentences added by non-native speakers are often bad.
All the other sentences are sometimes good, sometimes bad.
Take a look at this thread if you haven't read it yet.
it seems that I'm still far from knowing idiomatic japanese well enough to be able to rougly judge whether a sentence is natural or not. I guess I will simply be staying away from orphans wherever I can now.
maybe I'll talk to my japanese tandem partner about this again. if I can get him interested in the tatoeba project, I'm sure he could contribute to making the japanese corpus more reliable.
what is the status quo in this respect anyway? are there any concrete plans how to get rid of the huge amount of japanese orphans? if the japanese corpus is that unsafe right now, hiding the orphans like you suggested back then might not be a bad idea. maybe havingthe number of japanese sentence indicated to be 60k instead of 180k would also lead to a greater motivation among japanese contributors to increase the number of good sentences.