menu
Tatoeba
language
Registriĝi Ensaluti
language Esperanto
menu
Tatoeba

chevron_right Registriĝi

chevron_right Ensaluti

Foliumi

chevron_right Montri hazardan frazon

chevron_right Foliumi laŭ lingvo

chevron_right Foliumi laŭ listo

chevron_right Foliumi laŭ etikedo

chevron_right Foliumi sonregistraĵojn

Komunumo

chevron_right Muro

chevron_right Listo de ĉiuj membroj

chevron_right Lingvoj de la membroj

chevron_right Denaskaj parolantoj

search
clear
swap_horiz
search
gillux gillux 2015-majo-16, modifita 2015-majo-16 2015-majo-16 07:22:45 UTC, modifita 2015-majo-16 07:26:54 UTC link Konstanta ligilo

*** Attention to our members knowledgeable in Chinese, Shanghainese, Cantonese, Uzbek or Georgian. ***

I’m currently working on improving transcriptions in Tatoeba. Currently, transcriptions are automatically generated by a piece of software that sometimes fails providing correct transcriptions. A few of these failures are tagged with “incorrect transcription” [1] or “furigana mistake” for Japanese.

[1] https://tatoeba.org/tags/show_s..._with_tag/1673
[2] https://tatoeba.org/tags/show_s..._with_tag/1172

I plan to address this problem in two ways:
(1) improve the software that generates transcriptions based on feedback from users;
(2) allow users to edit transcriptions so that they can fix problems themselves.

Both of these approaches have limits, depending on the accuracy of the current transcriptions, the type of errors and the language. For instance, if a given transcription error is very widespread, it’s better to fix the software rather than to fix the same error by hand on a large number of sentences.

In Japanese, I am knowledgeable enough to know that no software is capable of providing 100% accurate transcriptions, so I will use mostly approach (2), and only a little bit of (1). However, I don’t know if it’s a good idea to use (2) in other languages we provide transcription for, namely Chinese (both traditional/simplified conversion and Pinyin), Shanghainese, Cantonese, Uzbek and Georgian.

So my question is, for each these languages:
• To what extend the current transcriptions are accurate? Try to give a percentage.
• If this percentage is lower than 100%:
   • Do you think it’s a good idea to systematically display autogenerated transcriptions (like we do now) although some are incorrect?
   • What type of errors can you see in the transcriptions? Try to categorize. If you know software development, how easily can they be detected and fixed?

Feel free to translate this post in the concerned languages.

{{vm.hiddenReplies[22679] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
pullnosemans pullnosemans 2015-majo-16, modifita 2015-majo-16 2015-majo-16 10:38:01 UTC, modifita 2015-majo-16 15:58:15 UTC link Konstanta ligilo

very glad that you're addressing this issue; I was going to mention it on tatoeba day, but the sooner, the better. note that there is also the "wrong transliteration" tag that I've recently been using for japanese and mandarin.

to answer your questions, I would say that
* the percentage of correctly transliterated sentences without any problems should be around 80 to 90 percent. I regularly notice errors, but when I pay active attention, I see that most sentences I come across are in fact transliterated correctly. however, my skills in both languages are only mediocre, so this might be somewhat off from the actual situation.
* yes, autogenerated transcriptions with flaws are better than none. just think of our contributors manually adding the transliteration to every single japanese sentence they own. I wouldn't want to have them do that, not to mention the fact that the majority of japanese sentences on tatoeba are orphans.
* the errors mainly concern characters with multiple readings (duh). the wrong choice of transliteration is then sometimes chosen by the tool for both morphological and syntactic reasons. e.g., in mandarin compounds, the problem is apparently that the compound is saved in the tool's database as one word, but with the wrong pinyin. but often single characters (i.e. where a character doesn't have any direct context with other characters that form a word with it) are wrongly transliterated if the tool cannot correctly analyze the syntax of a sentence. I have no idea about computing of any kind, so this is where my part ends.

I generally think that giving sentence owners the possibility to manually change a sentence's transliteration would be a very good first step. we could also add some kind of marking whether a sentence's transliteration is automatically generated or hand-made.

stoked to see this being improved! I'll be happy to offer all the help I can.

{{vm.hiddenReplies[22685] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-17, modifita 2015-majo-17 2015-majo-17 00:34:23 UTC, modifita 2015-majo-17 00:37:39 UTC link Konstanta ligilo

> however, my skills in both languages are only mediocre

Which ones?

About the Japanese furigana/transcription

> * yes, autogenerated transcriptions with flaws are better than none.
I strongly disagree. To me, there is no point into showing furigana if it’s not trustworthy. It’s no good for Japanese learners because they may think it’s trustworthy whereas it’s not, and because they eventually need to know the readings by themselves to be able to judge whether it’s correct or not, which defeats the original purpose. It’s no good for Japanese speakers neither because they may think Tatoeba is not a serious project. See also http://en.wiki.tatoeba.org/articles/show/furigana.

Because of this, I plan not to show furigana by default unless reviewed, and to allow showing untrustworthy transcriptions by default for every sentence like now with an option.

> just think of our contributors manually adding the transliteration to every single japanese sentence they own.

Yes, the task of manually adding transcriptions for every sentence is huge but I think it’s the only way. I plan to ease that process by using autogenerated transcriptions as a base one could review by editing the wrong parts only. Something like this:
1. Show autogenerated transcription by clicking a button http://prntscr.com/71kvcc
2. Edit and send it http://prntscr.com/71kvvy
3. It’s reviewed http://prntscr.com/71kwho

In addition, I plan to allow reviewing transcriptions of other contributor’s sentences unless the sentence owner reviewed it.

What do you think?

{{vm.hiddenReplies[22699] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
pullnosemans pullnosemans 2015-majo-17, modifita 2015-majo-17 2015-majo-17 11:39:42 UTC, modifita 2015-majo-17 11:40:23 UTC link Konstanta ligilo

>> however, my skills in both languages are only mediocre
> Which ones?
the two I mentioned, japanese and mandarin.

>> * yes, autogenerated transcriptions with flaws are better than none.
> I strongly disagree.
haha wow, alright.
I see your point, though. for me personally, not having furigana on most japanese sentences would be a bummer because I use tatoeba all the time to check the readings of sentences I have in anki, but I can just go back to using jisho.org for this purpose (which has issues similar to those on tatoeba, but maybe a tad less? I'm not sure). if there is a furigana mistake in a sentence, I usually remember it, so I can maneuver around those when studying in anki. but yeah, I see this is a relatively specific need, so I can't assume the majority of tatoeba users will have it as well. I won't be seeing japanese sentences on tatoeba as much as before, thus having less opportunities to notice any problems, but, oh well.

HOWEVER. I really like your idea of introducing a "show autogenerated transcription with a warning label" button. it sounds like a very good compromise between getting rid of wrong transliterations and yet maintaining the possibility to get a transcription for any sentence right away at your own risk. it would also raise the awareness that there are errors by a mile. so:

> What do you think?
I think you should go for it. try the same thing for mandarin, we'll see how it works out.

{{vm.hiddenReplies[22702] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-17 2015-majo-17 12:25:35 UTC link Konstanta ligilo

Thank you for your positive feedback.

> I see your point, though. for me personally, not having furigana on most japanese sentences would be a bummer because I use tatoeba all the time to check the readings of sentences

That’s why I thought about having an option to display transcriptions by default like it is now. Members would need to opt-in by going to their settings, and this would be a good chance to warn them about untrustworthy transcriptions. I will probably implement this in a later update though, it’s not essential.

Pfirsichbaeumchen Pfirsichbaeumchen 2015-majo-17, modifita 2015-majo-17 2015-majo-17 12:00:43 UTC, modifita 2015-majo-17 12:04:45 UTC link Konstanta ligilo

The furigana section takes up a lot of space. A spontaneous suggestion would be to use a smaller font size and perhaps put the furigana in brackets behind the kanji in a similar way as is shown here: http://prntscr.com/71kvvy.

Will the romanisation be editable as well? There are some strange word separations such as "ni hiki" and "i masu" instead of "nihiki" and "imasu".

{{vm.hiddenReplies[22704] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-17 2015-majo-17 12:13:36 UTC link Konstanta ligilo

The furigana section takes as much space as now (see the reference #187012), modulo the romaji. This is a somewhat provisional display though, I still don’t really know what to do with the romaji. I like having it a bit hidden like now (only displayed when hovering the mouse), but it’s rather impractical.

{{vm.hiddenReplies[22705] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
Pfirsichbaeumchen Pfirsichbaeumchen 2015-majo-17, modifita 2015-majo-18 2015-majo-17 12:29:10 UTC, modifita 2015-majo-18 13:38:45 UTC link Konstanta ligilo

I have the furigana turned off, but I realise it's always been that size. I just thought this would be a good opportunity to suggest using a smaller font size. ☺

gillux gillux 2015-majo-18, modifita 2015-majo-18 2015-majo-18 04:30:55 UTC, modifita 2015-majo-18 15:49:55 UTC link Konstanta ligilo

> Will the romanisation be editable as well? There are some strange word separations such as "ni hiki" and "i masu" instead of "nihiki" and "imasu".

You can, but not directly. I don't want to make it fully editable because I feel it's gonna become too much inconsistent, and because it is based on the furigana. Instead, if you feel like editing the romaji word separation, edit the furigana (remove or insert spaces) and the romaji will update according to it.

I already dealt with the problem you're mentioning (verbs like "i masu" separated) in a recent update, but only for the furiganas. They now display います instead of い ます, and so will romajis once I'll be done with this.

orcrist orcrist 2015-majo-31 2015-majo-31 06:13:52 UTC link Konstanta ligilo

Auto-generated phonetic readings of Japanese sentences is dangerous. It will regularly get things wrong and learners won't find out for months or years, if ever.

What do you do then? An honest warning note would be so scary that I think most users would immediately say "No thanks!". Or not, but anyway, the warning note would need to be very clear. "Auto-generated kana is very often incorrect. Consult a fluent Japanese speaker to confirm the correctness of this information." And then people would naturally think, "Yeah, that's why I came to this website in the first place. Because it's supposed to have that kind of user-generated data. Right?"

{{vm.hiddenReplies[22873] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-31 2015-majo-31 07:49:57 UTC link Konstanta ligilo

> Or not, but anyway, the warning note would need to be very clear. "Auto-generated kana is very often incorrect

It’s not “very often incorrect”. I’d say it’s 70% accurate, or maybe 90% if you ignore obviously incorrect numbers.

What do you think of the current warning note:
> The following transcription has been automatically generated and may contain errors. If you can, you are welcome to review by clicking it.

User55521 User55521 2015-majo-17, modifita 2015-majo-17 2015-majo-17 19:45:13 UTC, modifita 2015-majo-17 19:45:50 UTC link Konstanta ligilo

I was the one who provided the initial Uzbek code. Please note this is not a transcription, but a transliteration/script conversion. It is quite correct, I've browsed through 20 pages and found only 1 mistake: https://tatoeba.org/eng/sentences/show/3837742 (obviously, WҳацАпп should be WhatsApp). My Uzbek is very limited, but the Latin script was basically created as a one-to-one mapping for Cyrillic, so most problems are with Russian loanwords (and even these are quite predictable).

As a Cantonese learner, I'm using the Cantonese transcription quite a lot. It does have flaws (hard for me to estimate the percentage, will try to take several pages and count the number), but it is useful for me because I have problems memorising the tones, and I usually can spot most transcription errors. Generating a completely correct transcription is very complex programmatically, probably even more complex than for Standard Written Chinese, because SWC is more codified.

{{vm.hiddenReplies[22722] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-17 2015-majo-17 20:35:22 UTC link Konstanta ligilo

Thank you for your feedback about Uzbek. The conversion algorithm is indeed very simple and nearly 100% accurate. I don’t think there is any problem keeping all the Uzbek transliterations displayed, like now. I tend to use the word “transcription” for “transliteration” because they are handled very similarly on Tatoeba, but I totally understand the difference.

Your comment about Cantonese transcriptions and pullnosemans’s about Mandarin transcriptions suggest that despite they are inaccurate, these transcriptions are still very useful to you so I will definitely need to implement the “display everything by default” option before the first release.

Speaking of which, I’d like to have your opinions (especially Pfirsichbaeumchen’s since you’re admin) about how to manage transcriptions edition permissions. I tried to find a balance between keeping it open yet preventing edition problems. There are so many transcriptions that it doesn’t make much sense to me to only allow sentence authors to submit transcriptions of their sentences.

For a given sentence of a language in which we allow transcription edition:
• If nobody submitted a transcription, anybody may submit one.
• If you’re the owner of the sentence and someone submitted a transcription, you may overwrite it with your own.
• Once the sentence owner submitted a transcription to his/her own sentence, only he/she, corpus maintainers and admins may further modify it.

It sounds about right for transcriptions that are well-known by natives such as in Uzbek Cyrillic/Latin or Japanese furigana, but what about Pinyin or Jyutping? To what extend Mandarin and Cantonese speakers are confident with these? And for Shanghainese, the current transcription is based on IPA, but I believe most learners can’t read it and most natives can’t write it… What do we do with this?

{{vm.hiddenReplies[22724] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
User55521 User55521 2015-majo-17, modifita 2015-majo-17 2015-majo-17 21:03:32 UTC, modifita 2015-majo-17 21:25:25 UTC link Konstanta ligilo

Most Cantonese native speakers I've talked to can’t use either Jyutping or any other transcription.

In my experience, it's easiest to talk about pronounciation with native speakers by using other characters with the same pronounciation. But I'm not sure how this can be implemented in an intuitive way. Probably, have an on-hover hint for all the syllables?.. :?

The situation with Pinyin seems better. If I'm not mistaken, it is taught at school in Mainland China (but I believe Taiwan uses Zhuyin instead).

tommy_san tommy_san 2015-majo-18, modifita 2015-majo-18 2015-majo-18 04:15:26 UTC, modifita 2015-majo-18 04:15:52 UTC link Konstanta ligilo

A possible scenario:
1. A non-native speaker provides a wrong transcription for a bad sentence.
2. Native speakers don't want to correct the transcription because that might give the wrong impression that it were a good sentence, nor do they want to correct the sentence because that would prompt non-native speakers to add even more bad sentences. As a result, the wrong transcription remains uncorrected.
3. People "learn" from it, believing it to be trustworthy because the transcription isn't machine-generated.

{{vm.hiddenReplies[22727] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-29 2015-majo-29 02:53:14 UTC link Konstanta ligilo

I added a function to reset a human-submitted translation to its initial machine-generated state. This way, wrong transcriptions can be deleted without needing to provide a correct transcription.

{{vm.hiddenReplies[22853] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
tommy_san tommy_san 2015-majo-29 2015-majo-29 12:27:37 UTC link Konstanta ligilo

That doesn't sound very constructive... I wonder if there's not a better way to deal with it, but I can't think of any right now. I hope I won't have to use that function too often.

{{vm.hiddenReplies[22857] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-29 2015-majo-29 18:00:50 UTC link Konstanta ligilo

Note that I didn’t initially add this function for that purpose. I’d just a way to “remove” a transcription, which is an essential part of the whole thing.

I don’t think there is an easy solution to this, it’s just like trying to deal with people adding incorrect sentences to Tatoeba.

User55521 User55521 2015-majo-29 2015-majo-29 14:41:57 UTC link Konstanta ligilo

Will changes to the transcription be logged?

{{vm.hiddenReplies[22858] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-29 2015-majo-29 18:01:05 UTC link Konstanta ligilo

No, that’s not planned.

User55521 User55521 2015-majo-17 2015-majo-17 19:57:11 UTC link Konstanta ligilo

By the way, about the 'Chinese' language. The Standard Written Chinese can be read not just with Mandarin readings (which are auto-generated now), but also with Cantonese readings. For example, the sentence #4194512 can be read 'Deoi3 ngo5 ji4jin4 ze3 si6 zung6jiu3 dik1'. The usage of this form of the language is limited (it is used for reading written texts aloud, singing songs, but usually not for day-to-day conversation), but I would find it useful if there was an option to display 'Chinese' texts with Cantonese pronounciation, because it's the pronounciation I'm trying to learn.

Of course, this should ideally be a selectable option (probably selectable in the settings) because most Chinese learners do want to read it with Mandarin pronounciation, not with Cantonese.

{{vm.hiddenReplies[22723] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
nickyeow nickyeow 2015-majo-29 2015-majo-29 20:31:25 UTC link Konstanta ligilo

Agreed!

On a side note, it would be nice if pronunciations of compounds could be displayed with spaces in between the individual syllables. 'zung6 jiu3' looks much better than 'zung6jiu3' in my opinion.

{{vm.hiddenReplies[22863] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
gillux gillux 2015-majo-30 2015-majo-30 03:56:04 UTC link Konstanta ligilo

I know absolutely nothing about Cantonese, but I thought displaying compounds glued would help reading, just like we usually display Japanese romaji with spaces between each “word”.

nickyeow nickyeow 2015-majo-29 2015-majo-29 20:23:38 UTC link Konstanta ligilo

Wow, thank you so much for taking on this issue!

I think the autogenerated transcriptions for Cantonese are doing okay. There are mistakes here and there, but most of them are caused by a small set of characters with multiple pronunciations. Many of these can be solved by adding more pronunciations for compounds. I'd say the percentage of sentences with completely correct transliterations is around 90% to 95%.

Some characters can be quite tricky though. For instance, the final particle 喎 is pronounced wo3 when it indicates a casual remark, wo4 when it indicates a sort of playful scolding, and wo5 when it is used to quote something undesirable. Apparently, it can also be pronounced kwaa1, waa1, and wo1, although these are extremely rare. To be fair, I only found out about these (I blush to confess) when I looked them up in a dictionary. :p

I think the review system you suggested would be the best way to solve the problem. Perhaps autogenerated transcriptions could still be displayed—they are correct most of the time after all—but it would be nice if reviewed transcriptions could be given a little green tick or something.