menu
Tatoeba
language
Luo käyttäjätili Kirjaudu sisään
language Suomi
menu
Tatoeba

chevron_right Luo käyttäjätili

chevron_right Kirjaudu sisään

Selaa

chevron_right Näytä satunnainen lause

chevron_right Selaa kielen mukaan

chevron_right Selaa listan mukaan

chevron_right Selaa tunnisteen mukaan

chevron_right Selaa äänitteitä

Yhteisö

chevron_right Seinä

chevron_right Luettelo kaikista jäsenistä

chevron_right Jäsenten kielet

chevron_right Äidinkieliset puhujat

search
clear
swap_horiz
search
TRANG TRANG 7. joulukuuta 2018 7. joulukuuta 2018 klo 22.56.51 UTC link Ikilinkki

Anyone would be interesting in running an analysis on the quality of Tatoeba's corpus? The question to answer would be: what's the percentage of our sentences that can be safely considered as good/correct?

You could go for an empirical approach, for instance by looking through a random sample of 1000 sentences in your native language, reviewing them one by one and count how many of them have a mistake (and of course post a comment when you find mistakes). Let's say you find 50 sentences that have mistakes, it would suggest that potentially 95% of the sentences in that language are good and 5% are bad.

Or you could go elaborate your own criteria on how to detect that a sentence is good (if it's tagged "OK", if it's rated as "OK", if it has been commented and corrected, if it has audio, etc). Make some script to count the sentences that match your criteria, run it against the data from the Downloads page, and get some numbers of out it.

These are just some ideas. You can come up with any other approach that makes sense to you.

The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

Either way, I think it's an interesting question a lot of us would be interested to know the answer of, so I'm throwing out the idea and I hope it'll catch the interest of some of you :)

{{vm.hiddenReplies[30930] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
Ricardo14 Ricardo14 8. joulukuuta 2018 8. joulukuuta 2018 klo 0.35.41 UTC link Ikilinkki

I'm interested.
My approach would be: Tag as "OK" sentences which languages I speak and ask some other members to do the same.
Then we could compare how many of them were tagged (are good sentences) and how many weren't (*maybe* are not good sentences).

belkacem77 belkacem77 8. joulukuuta 2018 8. joulukuuta 2018 klo 20.59.59 UTC link Ikilinkki

For Kabyle, I suggest @Amazigh_Bedar and he can suggest more names to help.
Bedar is a linguist (Kabyle and other berber languages)

deniko deniko 8. joulukuuta 2018, muokattu 8. joulukuuta 2018 8. joulukuuta 2018 klo 21.59.11 UTC, muokattu 8. joulukuuta 2018 klo 22.00.49 UTC link Ikilinkki

I'd say that whoever checks the sentences for their languages shouldn't be an active contributor, or at least not one of the top contributors for their language.

For example, for Ukrainian I would be mostly checking my own sentences. Which is not that bad, it's proofreading, but it's always makes more sense to have a fresh look, and it's more efficient.

Also, I think there could be a lot of categories of "good" and "bad" sentences.

For example, depending on the context, I can classify as good either some subset of the list below, or all of them:

1. Sounds like something a native speaker would say, no spelling mistakes, no punctuation mistakes.

2. Sounds like something a native speaker would say, no spelling mistakes, some problems.with punctuation (missing comma, missing full stop at the end, unnecessary comma, etc.)

3. Sounds like something a native speaker would say, contains a minor typo. (Tom has an elephannt.)

4 Sounds like something a native speaker would say, contains a spelling mistake typical to native speakers ("I know more then you.", "I would of done it.", "Your my friend.")

5. Sounds awkward, but still acceptable.

6. Sounds like something a native speaker of a different dialect of my language would say.

{{vm.hiddenReplies[30936] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
Aiji Aiji 9. joulukuuta 2018 9. joulukuuta 2018 klo 6.19.17 UTC link Ikilinkki

I'm part of those who do not believe in "good" sentences. That is a very bad word, impractical to use.

That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?
In 2, if the punctuation is really erroneous, that is not simply a matter of appreciation but a real mistake in the sentence, it needs to be corrected.
Similarly, 3 and 4 need to be corrected.

I think I see your point, but from the experiment point of view related to what TRANG was talking about, they would weigh as "bad" sentences in the balance to extrapolate to the whole corpus (of course, that is not possible, but I think you see what I mean).

As for proofreading, I agree with you. Proofreading is best done with a fresh set of mind.

{{vm.hiddenReplies[30937] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
CK CK 10. joulukuuta 2018, muokattu 31. lokakuuta 2019 10. joulukuuta 2018 klo 1.01.48 UTC, muokattu 31. lokakuuta 2019 klo 3.23.11 UTC link Ikilinkki

[not needed anymore- removed by CK]

{{vm.hiddenReplies[30941] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
deniko deniko 10. joulukuuta 2018 10. joulukuuta 2018 klo 9.13.06 UTC link Ikilinkki

> If by "still acceptable", you mean to include obviously incorrect language use

I meant "awkward, but something a native speaker might write, and not correct themselves after re-reading it".

It's really difficult to come up with English sentences like this to me because I'm nowhere near native level, but in Ukrainian I've been coming across a lot of examples of such phrases. Basically, "bad style" or something like that.

deniko deniko 10. joulukuuta 2018, muokattu 10. joulukuuta 2018 10. joulukuuta 2018 klo 9.11.00 UTC, muokattu 10. joulukuuta 2018 klo 9.13.38 UTC link Ikilinkki

> That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?

Yes, they need to be corrected, but they are still good according to my low standards. My standard is: if it can pass for a sentence written by a sober native speaker who had re-read it twice - it's good enough.

So I just wanted to know what the standards for "good" are. Something that needs correction? Then any single fault makes it "bad". Which is fine by me, "good" and "bad" are too broad of terms.

Basically, I would like to distinguish between something like "How much time?" (intended meaning - "What time is it?") and a genuine typo or even an error that a lot of native speaker make.

But if the idea is to put both "How much time?" and "I'm taller then him." into the same basket because they both need correction, I'd be fine with that as well.

soliloquist soliloquist 9. joulukuuta 2018, muokattu 9. joulukuuta 2018 9. joulukuuta 2018 klo 19.29.56 UTC, muokattu 9. joulukuuta 2018 klo 20.09.00 UTC link Ikilinkki

I have finished checking 1000 random Turkish sentences. The results are as follows.

- Sentences in good condition (no errors & natural-sounding) : 764 (76.4%)

- Sentences owned by non-native speakers: 3 (0.3%)

- Sentences with spelling or punctuation errors (accent letters are not taken into account) : 43 (4.3%)

- Sentences with other grammatical errors: 18 (1.8%)

- Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

- Partly unnatural sentences (excessive pronoun usage, improvable word choices etc. - sentences that don't sound smooth) : 121 (12.1%)

- Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

They are more than 1000 (100%) in total because some of them fall into more than one category.

{{vm.hiddenReplies[30940] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
deniko deniko 10. joulukuuta 2018, muokattu 10. joulukuuta 2018 10. joulukuuta 2018 klo 9.19.41 UTC, muokattu 10. joulukuuta 2018 klo 9.20.01 UTC link Ikilinkki

You've done a great job, and I liked how you created a lot of categories instead of just saying "good" or "not good".

> Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

If we started checking this within the task defined by Trang, it would be impossible to complete.

Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:

> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

{{vm.hiddenReplies[30945] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
soliloquist soliloquist 10. joulukuuta 2018 10. joulukuuta 2018 klo 19.35.22 UTC link Ikilinkki

Thank you for your remarks, deniko.

> I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

Yes, most of them are directly-linked to English and most of the rest are indirectly-linked to English, so I didn't have much difficulty checking them. Sometimes I looked to Google Translate and Glosbe for sentences other than English to solidify my decisions. If I encountered an isolated unusual pair like Marathi-Turkish I would probably skip it.


> Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think English is a special case. Many of the English sentences are original. I don't think evaluating the quality of a corpus with translations linked to it later would be very fair, but I accept that's a logical dilemma. A cooperative and multilingual work is needed here. As Goethe said, let everyone sweep in front of his own door and the whole world will be clean.


> From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:
> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

I beg to differ. Some of those 'partly unnatural' sentences might be acceptable for language books to show/stress some aspects of the language, but from a purely native standpoint, they're not much different from the 'completely unnatural' sentences. In novel translation, for instance, such sentences would irritate readers. From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

{{vm.hiddenReplies[30946] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
deniko deniko 11. joulukuuta 2018 11. joulukuuta 2018 klo 9.29.32 UTC link Ikilinkki

> From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

Of course it is. I don't think we have ever argued about that.

However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is. Not everybody is able to speak and write clearly and smoothly. So I thought you meant your "Partly unnatural sentences" were in that category.

If they clearly sounded like something a native speaker would never say, then it's just unnatural sentence, in my opinion.

{{vm.hiddenReplies[30948] ? 'expand_more' : 'expand_less'}} piilota vastaukset näytä vastaukset
soliloquist soliloquist 11. joulukuuta 2018 11. joulukuuta 2018 klo 19.36.25 UTC link Ikilinkki

> However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is.
> So I thought you meant your "Partly unnatural sentences" were in that category.

The difference may seem unclear, but I'd rather call them 'non-standard'. Other than that, we're on the same page, I think.

Aiji Aiji 12. joulukuuta 2018 12. joulukuuta 2018 klo 0.13.32 UTC link Ikilinkki

That's an interesting discussion, but we can go down forever and ever ^^
We can split "awkward" sentences into correct awkward, academic awkward, regional awkward, etc. With Tatoeba current system, they probably should be tagged, but not everybody thinks about it, or can, and in the absolute tags are here to give meta-information and help the person who would not understand why such a sentence is correct.

A simple example: The French « C'est quelle heure ? » is (clearly) grammatically incorrect. However, this is the natural way to ask the time in some regions... Hence, even "natural-sounding" sentences are up to debate. Etc. :)

AlanF_US AlanF_US 12. joulukuuta 2018, muokattu 12. joulukuuta 2018 12. joulukuuta 2018 klo 4.54.27 UTC, muokattu 12. joulukuuta 2018 klo 12.49.07 UTC link Ikilinkki

> The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

I would say the following:
(1) the goal of extracting patterns on what helps and what doesn't is more important than trying to come up with a single number
(2) coming up with a number doesn't get us much closer to figuring out what helps and what doesn't
(3) if you have X people doing the counting, you'll have X systems for doing it

Having said that, I would like to toss into the mix my opinions about what makes a sentence valuable to me.

First of all, I want to point people to the wiki page "How To Write Good Sentences" ( https://en.wiki.tatoeba.org/art...ood-sentences# ), which captures my feeling that a good sentence meets the following criteria:
- clear
- self-contained
- likely
- standard dialect
- natural
- unlikely to offend

For my purposes, a really useful sentence meets these criteria and goes even further:
- contains words like "surprised" that are in the second or third tier of frequency, meaning that they're important but neither among the very first words learned (like "water") nor among the rare and quirky words that even native speakers are unlikely to use (like "whortleberry")
- demonstrates the meaning of the rarest/most advanced words in the sentence
- has some interesting characteristic that makes it easy to remember
- is not irritating

My model of an ideal sentence is (1) one that would be found in a good dictionary whose entries contain sample sentences OR (2) one that would be found in a good textbook, like the kind from which I learned languages in the 1980s and 1990s, before the online era but after the age during which the focus was on drilling or lecturing the student. (I've seen textbooks from the 1930s and 1940s with series of sentences like "The first month is January. The second month is February." or "Jane is a diligent student. She always does her homework." I think language instruction improved greatly after that time.) I see "textbook sentences" criticized here from time to time as being unnatural, but I believe that the critics are thinking of textbooks like the kind published in Asia by non-native speakers to teach English. In my experience, many language textbooks are written by people who are good at coming up with fresh, novel, natural sentences that demonstrate language as it is actually used.

Sentences of the form "I am [adjective]. You are [same adjective]. He is [same adjective]." are not particularly helpful to me. They're not interesting, and they don't demonstrate what the adjective means. Tatoeba is different from other crowd-sourced multilingual dictionaries in that it can provide sentences that give people a sense of the flavor of the words they contain. Sentences that match a simple template do not fulfill the site's potential. However, I realize that it's hard work to come up with sentences of the particularly useful kind I'm describing, and I myself spend far more time translating sentences that other people have written than coming up with new ones.

soliloquist soliloquist 12. joulukuuta 2018 12. joulukuuta 2018 klo 20.45.55 UTC link Ikilinkki

The following are numbers of comments per 1000 sentences for the top 10 languages.

- English: 67
- Russian: 82
- Turkish: 24
- Italian: 14
- Esperanto: 180
- German: 177
- French: 143
- Portuguese: 60
- Spanish: 137
- Hungarian: 88

I know it's far from being a precise and reliable source to make a judgement as comments reporting wrong-flag errors or giving annotations and some other factors like low number of active speakers decrease its efficiency, but still it might give a rough idea about how actively and effectively sentences in these languages are checked and maintained.