menu
Tatoeba
language
Registrieren Anmelden
language Deutsch
menu
Tatoeba

chevron_right Registrieren

chevron_right Anmelden

Durchsuchen

chevron_right Zufälligen Satz anzeigen

chevron_right Nach Sprache durchsuchen

chevron_right Nach Liste durchsuchen

chevron_right Nach Etikett durchsuchen

chevron_right Audiodateien durchsuchen

Mitglieder

chevron_right Pinnwand

chevron_right Mitgliederliste

chevron_right Mitglieder nach Sprachen

chevron_right Muttersprachler

search
clear
swap_horiz
search
TRANG TRANG 7. Dezember 2018 7. Dezember 2018 um 22:56:51 UTC link zur Pinnwand

Anyone would be interesting in running an analysis on the quality of Tatoeba's corpus? The question to answer would be: what's the percentage of our sentences that can be safely considered as good/correct?

You could go for an empirical approach, for instance by looking through a random sample of 1000 sentences in your native language, reviewing them one by one and count how many of them have a mistake (and of course post a comment when you find mistakes). Let's say you find 50 sentences that have mistakes, it would suggest that potentially 95% of the sentences in that language are good and 5% are bad.

Or you could go elaborate your own criteria on how to detect that a sentence is good (if it's tagged "OK", if it's rated as "OK", if it has been commented and corrected, if it has audio, etc). Make some script to count the sentences that match your criteria, run it against the data from the Downloads page, and get some numbers of out it.

These are just some ideas. You can come up with any other approach that makes sense to you.

The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

Either way, I think it's an interesting question a lot of us would be interested to know the answer of, so I'm throwing out the idea and I hope it'll catch the interest of some of you :)

{{vm.hiddenReplies[30930] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
Ricardo14 Ricardo14 8. Dezember 2018 8. Dezember 2018 um 00:35:41 UTC link zur Pinnwand

I'm interested.
My approach would be: Tag as "OK" sentences which languages I speak and ask some other members to do the same.
Then we could compare how many of them were tagged (are good sentences) and how many weren't (*maybe* are not good sentences).

belkacem77 belkacem77 8. Dezember 2018 8. Dezember 2018 um 20:59:59 UTC link zur Pinnwand

For Kabyle, I suggest @Amazigh_Bedar and he can suggest more names to help.
Bedar is a linguist (Kabyle and other berber languages)

deniko deniko 8. Dezember 2018, bearbeitet am 8. Dezember 2018 8. Dezember 2018 um 21:59:11 UTC, bearbeitet 8. Dezember 2018 um 22:00:49 UTC link zur Pinnwand

I'd say that whoever checks the sentences for their languages shouldn't be an active contributor, or at least not one of the top contributors for their language.

For example, for Ukrainian I would be mostly checking my own sentences. Which is not that bad, it's proofreading, but it's always makes more sense to have a fresh look, and it's more efficient.

Also, I think there could be a lot of categories of "good" and "bad" sentences.

For example, depending on the context, I can classify as good either some subset of the list below, or all of them:

1. Sounds like something a native speaker would say, no spelling mistakes, no punctuation mistakes.

2. Sounds like something a native speaker would say, no spelling mistakes, some problems.with punctuation (missing comma, missing full stop at the end, unnecessary comma, etc.)

3. Sounds like something a native speaker would say, contains a minor typo. (Tom has an elephannt.)

4 Sounds like something a native speaker would say, contains a spelling mistake typical to native speakers ("I know more then you.", "I would of done it.", "Your my friend.")

5. Sounds awkward, but still acceptable.

6. Sounds like something a native speaker of a different dialect of my language would say.

{{vm.hiddenReplies[30936] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
Aiji Aiji 9. Dezember 2018 9. Dezember 2018 um 06:19:17 UTC link zur Pinnwand

I'm part of those who do not believe in "good" sentences. That is a very bad word, impractical to use.

That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?
In 2, if the punctuation is really erroneous, that is not simply a matter of appreciation but a real mistake in the sentence, it needs to be corrected.
Similarly, 3 and 4 need to be corrected.

I think I see your point, but from the experiment point of view related to what TRANG was talking about, they would weigh as "bad" sentences in the balance to extrapolate to the whole corpus (of course, that is not possible, but I think you see what I mean).

As for proofreading, I agree with you. Proofreading is best done with a fresh set of mind.

{{vm.hiddenReplies[30937] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
CK CK 10. Dezember 2018, bearbeitet am 31. Oktober 2019 10. Dezember 2018 um 01:01:48 UTC, bearbeitet 31. Oktober 2019 um 03:23:11 UTC link zur Pinnwand

[not needed anymore- removed by CK]

{{vm.hiddenReplies[30941] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
deniko deniko 10. Dezember 2018 10. Dezember 2018 um 09:13:06 UTC link zur Pinnwand

> If by "still acceptable", you mean to include obviously incorrect language use

I meant "awkward, but something a native speaker might write, and not correct themselves after re-reading it".

It's really difficult to come up with English sentences like this to me because I'm nowhere near native level, but in Ukrainian I've been coming across a lot of examples of such phrases. Basically, "bad style" or something like that.

deniko deniko 10. Dezember 2018, bearbeitet am 10. Dezember 2018 10. Dezember 2018 um 09:11:00 UTC, bearbeitet 10. Dezember 2018 um 09:13:38 UTC link zur Pinnwand

> That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?

Yes, they need to be corrected, but they are still good according to my low standards. My standard is: if it can pass for a sentence written by a sober native speaker who had re-read it twice - it's good enough.

So I just wanted to know what the standards for "good" are. Something that needs correction? Then any single fault makes it "bad". Which is fine by me, "good" and "bad" are too broad of terms.

Basically, I would like to distinguish between something like "How much time?" (intended meaning - "What time is it?") and a genuine typo or even an error that a lot of native speaker make.

But if the idea is to put both "How much time?" and "I'm taller then him." into the same basket because they both need correction, I'd be fine with that as well.

soliloquist soliloquist 9. Dezember 2018, bearbeitet am 9. Dezember 2018 9. Dezember 2018 um 19:29:56 UTC, bearbeitet 9. Dezember 2018 um 20:09:00 UTC link zur Pinnwand

I have finished checking 1000 random Turkish sentences. The results are as follows.

- Sentences in good condition (no errors & natural-sounding) : 764 (76.4%)

- Sentences owned by non-native speakers: 3 (0.3%)

- Sentences with spelling or punctuation errors (accent letters are not taken into account) : 43 (4.3%)

- Sentences with other grammatical errors: 18 (1.8%)

- Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

- Partly unnatural sentences (excessive pronoun usage, improvable word choices etc. - sentences that don't sound smooth) : 121 (12.1%)

- Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

They are more than 1000 (100%) in total because some of them fall into more than one category.

{{vm.hiddenReplies[30940] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
deniko deniko 10. Dezember 2018, bearbeitet am 10. Dezember 2018 10. Dezember 2018 um 09:19:41 UTC, bearbeitet 10. Dezember 2018 um 09:20:01 UTC link zur Pinnwand

You've done a great job, and I liked how you created a lot of categories instead of just saying "good" or "not good".

> Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

If we started checking this within the task defined by Trang, it would be impossible to complete.

Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:

> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

{{vm.hiddenReplies[30945] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
soliloquist soliloquist 10. Dezember 2018 10. Dezember 2018 um 19:35:22 UTC link zur Pinnwand

Thank you for your remarks, deniko.

> I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

Yes, most of them are directly-linked to English and most of the rest are indirectly-linked to English, so I didn't have much difficulty checking them. Sometimes I looked to Google Translate and Glosbe for sentences other than English to solidify my decisions. If I encountered an isolated unusual pair like Marathi-Turkish I would probably skip it.


> Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think English is a special case. Many of the English sentences are original. I don't think evaluating the quality of a corpus with translations linked to it later would be very fair, but I accept that's a logical dilemma. A cooperative and multilingual work is needed here. As Goethe said, let everyone sweep in front of his own door and the whole world will be clean.


> From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:
> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

I beg to differ. Some of those 'partly unnatural' sentences might be acceptable for language books to show/stress some aspects of the language, but from a purely native standpoint, they're not much different from the 'completely unnatural' sentences. In novel translation, for instance, such sentences would irritate readers. From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

{{vm.hiddenReplies[30946] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
deniko deniko 11. Dezember 2018 11. Dezember 2018 um 09:29:32 UTC link zur Pinnwand

> From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

Of course it is. I don't think we have ever argued about that.

However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is. Not everybody is able to speak and write clearly and smoothly. So I thought you meant your "Partly unnatural sentences" were in that category.

If they clearly sounded like something a native speaker would never say, then it's just unnatural sentence, in my opinion.

{{vm.hiddenReplies[30948] ? 'expand_more' : 'expand_less'}} Antworten verbergen Antworten anzeigen
soliloquist soliloquist 11. Dezember 2018 11. Dezember 2018 um 19:36:25 UTC link zur Pinnwand

> However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is.
> So I thought you meant your "Partly unnatural sentences" were in that category.

The difference may seem unclear, but I'd rather call them 'non-standard'. Other than that, we're on the same page, I think.

Aiji Aiji 12. Dezember 2018 12. Dezember 2018 um 00:13:32 UTC link zur Pinnwand

That's an interesting discussion, but we can go down forever and ever ^^
We can split "awkward" sentences into correct awkward, academic awkward, regional awkward, etc. With Tatoeba current system, they probably should be tagged, but not everybody thinks about it, or can, and in the absolute tags are here to give meta-information and help the person who would not understand why such a sentence is correct.

A simple example: The French « C'est quelle heure ? » is (clearly) grammatically incorrect. However, this is the natural way to ask the time in some regions... Hence, even "natural-sounding" sentences are up to debate. Etc. :)

AlanF_US AlanF_US 12. Dezember 2018, bearbeitet am 12. Dezember 2018 12. Dezember 2018 um 04:54:27 UTC, bearbeitet 12. Dezember 2018 um 12:49:07 UTC link zur Pinnwand

> The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

I would say the following:
(1) the goal of extracting patterns on what helps and what doesn't is more important than trying to come up with a single number
(2) coming up with a number doesn't get us much closer to figuring out what helps and what doesn't
(3) if you have X people doing the counting, you'll have X systems for doing it

Having said that, I would like to toss into the mix my opinions about what makes a sentence valuable to me.

First of all, I want to point people to the wiki page "How To Write Good Sentences" ( https://en.wiki.tatoeba.org/art...ood-sentences# ), which captures my feeling that a good sentence meets the following criteria:
- clear
- self-contained
- likely
- standard dialect
- natural
- unlikely to offend

For my purposes, a really useful sentence meets these criteria and goes even further:
- contains words like "surprised" that are in the second or third tier of frequency, meaning that they're important but neither among the very first words learned (like "water") nor among the rare and quirky words that even native speakers are unlikely to use (like "whortleberry")
- demonstrates the meaning of the rarest/most advanced words in the sentence
- has some interesting characteristic that makes it easy to remember
- is not irritating

My model of an ideal sentence is (1) one that would be found in a good dictionary whose entries contain sample sentences OR (2) one that would be found in a good textbook, like the kind from which I learned languages in the 1980s and 1990s, before the online era but after the age during which the focus was on drilling or lecturing the student. (I've seen textbooks from the 1930s and 1940s with series of sentences like "The first month is January. The second month is February." or "Jane is a diligent student. She always does her homework." I think language instruction improved greatly after that time.) I see "textbook sentences" criticized here from time to time as being unnatural, but I believe that the critics are thinking of textbooks like the kind published in Asia by non-native speakers to teach English. In my experience, many language textbooks are written by people who are good at coming up with fresh, novel, natural sentences that demonstrate language as it is actually used.

Sentences of the form "I am [adjective]. You are [same adjective]. He is [same adjective]." are not particularly helpful to me. They're not interesting, and they don't demonstrate what the adjective means. Tatoeba is different from other crowd-sourced multilingual dictionaries in that it can provide sentences that give people a sense of the flavor of the words they contain. Sentences that match a simple template do not fulfill the site's potential. However, I realize that it's hard work to come up with sentences of the particularly useful kind I'm describing, and I myself spend far more time translating sentences that other people have written than coming up with new ones.

soliloquist soliloquist 12. Dezember 2018 12. Dezember 2018 um 20:45:55 UTC link zur Pinnwand

The following are numbers of comments per 1000 sentences for the top 10 languages.

- English: 67
- Russian: 82
- Turkish: 24
- Italian: 14
- Esperanto: 180
- German: 177
- French: 143
- Portuguese: 60
- Spanish: 137
- Hungarian: 88

I know it's far from being a precise and reliable source to make a judgement as comments reporting wrong-flag errors or giving annotations and some other factors like low number of active speakers decrease its efficiency, but still it might give a rough idea about how actively and effectively sentences in these languages are checked and maintained.