menu
Tatoeba
language
S'inscrire Se connecter
language Français
menu
Tatoeba

chevron_right S'inscrire

chevron_right Se connecter

Parcourir

chevron_right Montrer une phrase au hasard

chevron_right Parcourir par langue

chevron_right Parcourir par liste

chevron_right Parcourir par étiquette

chevron_right Parcourir les enregistrements sonores

Communauté

chevron_right Mur

chevron_right Liste de tous les membres

chevron_right Langues des membres

chevron_right Langues natales des membres

search
clear
swap_horiz
search
tommy_san tommy_san 24 septembre 2015, modifiée le le 25 septembre 2015 24 septembre 2015 à 22:57:01 UTC, modifiée le 25 septembre 2015 à 01:44:30 UTC link Permalien

** Should we have transcriptions on tatoeba.org? **

There's a discussion going on on GitHub about transcriptions[1]. I open this thread to continue it because I think it deserves wider attention.

As I said before[2], I see tatoeba.org as a place to contribute and I think it shouldn't be too nice to language learners. It's competent translators (and example sentence writers) that we want to attract, not language learners. Of course, any translator is a learner at the same time, but not all learners are good translators.

We sometimes get bad translations of Japanese sentences by users who don't even know hiragana. This is obviously because we provide Romanization of Japanese sentences (which we've decided to do away with[3]). So what would happen if there were transcriptions in many more languages?

[1] https://github.com/Tatoeba/tatoeba2/issues/280
[2] https://tatoeba.org/wall/show_m...#message_21488
[3] https://tatoeba.org/wall/show_m...#message_22899

{{vm.hiddenReplies[24290] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
oyd11 oyd11 25 septembre 2015 25 septembre 2015 à 00:35:37 UTC link Permalien

Surely it deserves attention. You're discussing several topics, the main one deserving most attention is the project definition of Tatoeba.

If it's a "workplace" site for professional translators, I for once - am out, since I'm in no way a translator. I'd like to extented the analogy with Wikipedia I was making before with Wikipedia. It seems to be like a "Nupedia" vs "Wikipedia" like models that were presented as alternatives in the years 2000-2003
[ see: https://en.wikipedia.org/wiki/Nupedia, https://en.wikipedia.org/wiki/History_of_Wikipedia ]

If I get your flow, you are suggesting a more peer-reviewed attitude where most contributors are professional translators, rather than just native speakers or native speakers/language learners interested in a specific niche-vocabulary or whatever.

So firstly, as a user who was not involved in any of the internal-'dev' discussions till now - I'm really surprised with this attitude / project definition, it takes me by surprise and not consistent with the way Tatoeba was presented publicity (Videos, Main-page description, Wikipedia article, etc) - how common is this attitude you're suggesting here ? And if it is the main attitude of the project, I seriously suggest it should be: a. made more public ; b. that an alternative project aimed at the general language learner be opened.

Further - I don't see how this specifically links to transliteration availability, transliterations are commonly used as an aid for professional translators as they need to cross validate things in languages they do not speak ( the only time I did do professional translation, I had to check the original bible phrase in Greek of "eye for an eye", checking for the exact grammatical case used in Greek, luckily it was easy to find Grammar tables transliterated in Latin script, not that reading Greek alphabet is any hard, but just an example ).

The second issue you mention is "bad translations of Japanese sentences". As I do not speak nor read any Japanese - I fail to see how this is related to the existence of transliterations, I tried to follow the thread you link under [3], but it's really long and convoluted, is it common knowledge linking these "bad translations" to the existence of Transcription ?

As I mention in thread [1] - the whole discussion should not confuse transcriptions with transliterations, here's a link clarifying the difference and explaining some needs of transliteration, which is a totally reversible process that deals with graphemes
[ http://www.translitteration.com...literation/en/ ]

I suggest you deal with your bad-translator individually, or is it a repeating issue with many individuals?

Japanese - I suspect - is not a very representable language to the common situation where transliteration is really needed or highly useful - as you're suggesting there is a high correlation between level of language fluency and literacy, which is hardly the case with most languages. Also Japanese has no close sister languages which use different alphabets.
I guess it's still useful, but I think that's for the Japanese speaking or learning community to decide.

As my thread started - for Ethiopian languages - transliteration is of high importance (high number of speakers native speakers who are not literate in the national-script, huge numbers of speakers of sister languages with different scripts, relatively difficult script to master, high number of speakers use Latin transliteration online or in text-messaging as it is).

Also, the case with most languages, which also might be the opposite for Japanese - is firstly gathering a speech-community into the project rather than keeping with the flood of filtering translations you're describing in Japanese.

For summery - I suggest - we keep: a. language specific issues - specific. b. social issues social and not technical. c. clarify publicly what and who should contribute to Tatoeba, right now I don't see any official page supporting the project goal or attitude you've presented,
Eg. the English Wikipedia article about Tatoeba starts

" Tatoeba.org is a free collaborative online database of example sentences geared towards *foreign language learners.* "

That doesn't correlate with quote " it shouldn't be too nice to language learners."

On Tatoeba's application to Mozilla Drumbeat, I see
" it’s easier for amateurs to translate sentences rather than words, and thus makes building a database accessible for more people. (Alan Simon)"
[https://henrikmoltke.wordpress....2/30/tatoeba/]
Implying at the time - Tatoeba contributors where assumed to be amateurs rather than professional translators as you suggest. Nor is anything like that implied in any introduction to the site ( eg. [ http://en.wiki.tatoeba.org/arti...ow/quick-start ] )

So again, if Tatoeba - is not geared towards being nice and designed specifically for language learners who want to also (possibly) contribute content - this would be a very good time to state it, that was the whole idea that attracted me to the project initially, and I believe it's brilliant and it's gonna be big(ger, by scales).
If this is not the stated goal, it would make sense to start a sister project with the original stated goal of Tatoeba, same as Wikipedia followed Nupedia.

-K.

{{vm.hiddenReplies[24291] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
AlanF_US AlanF_US 25 septembre 2015 25 septembre 2015 à 02:51:03 UTC link Permalien

> ... how common is this attitude [tommy_san is] suggesting here ?

I've never heard anyone else say anything similar to the assertion that Tatoeba "shouldn't be too nice to language learners". So I think the answer is "not common at all".

For the sake of accuracy, I do think it's worth pointing out that tommy_san never mentioned the word "professional". He used the adjective "good". But I think the setting up of a dichotomy between "good translators" and "language learners" is problematic, even if one goes on to say that the former can include the latter.

AlanF_US AlanF_US 25 septembre 2015 25 septembre 2015 à 02:42:52 UTC link Permalien

> We sometimes get bad translations of Japanese sentences by users who don't even know hiragana. This is obviously because we provide Romanization of Japanese sentences (which we've decided to do away with[3]).

The connection is certainly not obvious to me.

{{vm.hiddenReplies[24292] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
CK CK 25 septembre 2015, modifiée le le 30 octobre 2019 25 septembre 2015 à 03:09:19 UTC, modifiée le 30 octobre 2019 à 10:30:36 UTC link Permalien

[not needed anymore- removed by CK]

{{vm.hiddenReplies[24296] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
oyd11 oyd11 25 septembre 2015 25 septembre 2015 à 08:52:17 UTC link Permalien

@ck, it seems to me your guys are facing a social issue in the Japanese corpus. It might or might not be related to the existence or nature of the provided transcription or transliteration by the system.

It certainly has little to do with technical issues on Github where I was discussing with Trang how to include transliteration for Ge'ez based Ethiopian languages, which have very different social and technologically representable context to Japanese (Tigrinya, Amharic, Tigre and ligaturic Ge'ez).

In a lot of languages "low level" language learners are familiar with the script, eg, all languages written in the Latin alphabet. That does not make them write random stuff nor contribute translations.

My suggestion is - within the context of non-latin-based languages which I know:
1. not to include *transcriptions* [ eg "Au revoir" (fr) => "oh ruh-VWAHR" ("pseudo-english")
2. to include *transliterations*. [ again, if you want the distinction clarified , see : http://www.translitteration.com...literation/en/ ]
(ie, provide the user graphic symbols they can make sense of. Not spell out, add, nor remove any phonetic rules not specified in the original text)

This suggestion is language specific, my scope is limited, I think different languages deserve specific attention here, and that's where you guys come in.
If there is something misleading, confusing or suboptimal within the processing and presentation of Japanese in Tatoeba, discuss it specifically.

For example - out of language communities I'm active in which are not "latin script based" - I see different issues that require different solutions within Arabic/Hebrew - Russian and Tiginrya. These technically different writing systems (Abjad vs Alphabet vs Syllabary) require different solutions on the "social" level as well (eg, how should Abjad's be transliterated if at all ? should they be transliterated between one another)




27 septembre 2015, modifiée le le 30 septembre 2015 27 septembre 2015 à 20:57:16 UTC, modifiée le 30 septembre 2015 à 04:00:34 UTC link Permalien
warning

Le contenu de ce message va à l'encontre de nos règles et a donc été caché. Il est seulement montré aux administrateurs et à l'auteur du message.

sharptoothed sharptoothed 25 septembre 2015 25 septembre 2015 à 08:24:40 UTC link Permalien

> I think it shouldn't be too nice to language learners

I think, we should distinguish two types of learners: those who use Tatoeba to learn languages and to improve or extend their knowledge but avoid contributing translations they are not certain in (or avoid contributing translations at all), and those who confuse Tatoeba with sites like Lang-8 and similar or consider Tatoeba as a good place to raise their self-esteem, etc. Tatoeba can't be too friendly for the former, I believe. The more tools and additional information on sentences Tatoeba provides, the better. As for the latter, I don't really think that lack of those tools and info would stop much of them. They still have Google Translate and stuff, after all.

By the way, I'm pretty sure that all Russian language learners here, no matter how proficient they are, would be happy to have, for example, a phonetic transcription for every Russian sentence or at least to see stress marks placed on every word. :-)

{{vm.hiddenReplies[24299] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
oyd11 oyd11 25 septembre 2015 25 septembre 2015 à 09:42:02 UTC link Permalien

I agree. However phonetic transcription and stress marks - are extra information which is not trivial to machine-generate, whereas transliterations are. (and that was the discussion around my adding a Tigrinya/Ge'ez transliteration routine which is 1-1), and converting the Georgian translitaration to follow a standard (ISO-9984).

As for confusing Tatoeba for Lang-8 like site - this is a highly social issue, should be dealt with on user's first contribution(s).

{{vm.hiddenReplies[24301] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
sharptoothed sharptoothed 25 septembre 2015 25 septembre 2015 à 12:51:50 UTC link Permalien

> transcription and stress marks - are extra information which is not trivial to machine-generate

Indeed, and this is true for Japanese furigana as well. So, it would be ideal if we could edit this data just like we edit sentences.

{{vm.hiddenReplies[24307] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
oyd11 oyd11 25 septembre 2015 25 septembre 2015 à 13:43:09 UTC link Permalien

I don't know much about Japanese - but for Russian - it would be for sure a separate and independent feature from Transliteration.

Please do not mix the issues of Transliterations and Transcriptions (whether or not it phonetic ones)

I would guess transliteration of Russian would be most useful for speakers of other Slavonic languages using a Latin script (eg, Polish, Slovene, Czech, etc) - where alot of the time - the Russian can serve as a cross-reference for another translation, etc.

I even see a case where in countries with two co-official scripts (Digraphia) - there should be an option to even write using one of them - and get both orthographies visible provided it's technically possible
[ a live example is the Serbian Wikipedia :: page is being rendered in either Latin or Cyrillic regardless of how it was written
https://sr.wikipedia.org/sr-el/
https://sr.wikipedia.org/sr-ec/ ]

Similarly - while not "total digraphia" - in Tatarstan Tatar - Latin was made co-official with Cyrillic in 1999, later "cancelled" in 2004. However we could offer totally automatic transliteration, as some native speakers clearly prefer it.
Mongolian is a similar case (with Cyrillic and "Ethnic Mongolian" script)

And there certainly is digrafia in contemporary Berber (Latin and Tifinagh) - native speakers have strong preferences here ( the Algerian Berbers currently on Tatoeba - use Latin , while Tifinagh is official in Morocco, while alot of Berber activists in Morocco urge for Latin to be official due to low literacy rates in Tifinagh).

{{vm.hiddenReplies[24308] ? 'expand_more' : 'expand_less'}} cacher les réponses montrer les réponses
sharptoothed sharptoothed 25 septembre 2015 25 septembre 2015 à 14:33:40 UTC link Permalien

> Please do not mix the issues of Transliterations and Transcriptions (whether or not it phonetic ones)

Of course these are separate things and they are to be dealt with separately. I just want to say that for different languages different issues arise and while for one language learners want to see a transliteration, they would prefer seeing transcription or other info for another. Some time ago, if my memory doesn't fail me, there were discussions about possibility of adding various metadata to sentences. I wonder if there are any plans to implement this feature one day.

tommy_san tommy_san 26 septembre 2015 26 septembre 2015 à 02:41:15 UTC link Permalien

> I think, we should distinguish two types of learners

You're right. I was thinking of this comment by Pfirsichbaeumchen that we play the role of teachers rather than students on Tatoeba (https://tatoeba.org/sentences/s...mment-428296). I hope the primary motivation of most contributors is to share their knowledge with the rest of the world. If the primary interest was to learn (to get taught and corrected), they usually wouldn't get what they expect.

> As for the latter, I don't really think that lack of those tools and info would stop much of them.

You're probably right here, too, considering for example Korean sentences on Tatoeba, which are said to be of horrible quality even though there's no transcriptions. Actually, I guess most of the bad Japanese sentences and bad translations of Japanese sentences are added by those who know some kanji and overestimate themselves.


(Let me write everything here because I don't want to mess up the Wall by posting too many replies.)

I think the classification I wrote here (https://tatoeba.org/wall/show_m...message_21480) would be useful for the current discussion.

(Original) Он очень похож на своего отца.
(1) On ochen' pohozh na svoego otca.
(2) ohn OH-cheen' pah-KHOZH nuh svuh-ee-VOH aht-TSAH.
(3) Он о́чень похо́ж на своего́ отца́.
(4) [on ˈot͡ɕɪnʲ pɐˈxoʐ nə svəjɪˈvo ɐtˈt͡sa]

I agree with oyd11 that when more than one script is used (more or less) officially, we should display transliterations of the type (1). Members should be allowed to contribute using the script they prefer.

I'm not sure if it's a good idea to display Romanization like this for any language that doesn't use the Latin alphabet. For example, I feel this transliteration for the Russian sentence isn't very useful to anyone, since no one would be able to read it properly unless they know Russian. If we decide to display transliterations for some languages and not for some others, what would be the criterion to divide them?

(2) is actually much more useful, but it's not suitable for an international project.

As sharptoothed says, I pretty much like pronunciation aids like (3). They cannot be generated by a machine alone, which means it's worth providing them manually. I believe they belong to the kind of data we want to collect on Tatoeba. When the system for editable furigana (https://tatoeba.org/wall/show_m...message_22870) is completed, we could consider applying it to other languages as well.


I admit it's rather hard for me to understand that a competent speaker is not necessarily literate. We do want to welcome those who can speak and translate well enough even if they're not literate in the language, but it's not that easy since our project is based on written sentences (of spoken and written language). I'm not really sure to what extent transliterations would help them. Would you suggest, for example, that we should let Amharic speakers who don't know the Ge'ez script add sentences using the Latin alphabet?

Ooneykcall Ooneykcall 27 septembre 2015 27 septembre 2015 à 10:01:44 UTC link Permalien

Stress input would be very nice, indeed, but no mandatory automatically generated stress input. That would kill the stress variety... like adding automatic IPA would have, since people do not, should not and do not have to speak always the same way. Plus it wouldn't reflect stress fluctuating.
Manual input would definitely be awesome if implemented in a convenient fashion, such as being able to mark a vowel as 'stressed' easily.

TRANG TRANG 26 septembre 2015 26 septembre 2015 à 23:52:20 UTC link Permalien

I've created this issue: https://github.com/Tatoeba/tatoeba2/issues/784

This was requested before, but there was no issue for it yet.