menu
تتويبا
language
سجّل لِج
language العربية
menu
تتويبا

chevron_right سجّل

chevron_right لِج

تصفح

chevron_right Show random sentence

chevron_right تصفح حسب اللغة

chevron_right تصفح حسب القائمة

chevron_right تصفح حسب الوسم

chevron_right تصفح ملفات الصوت

المجتمع

chevron_right الحائط

chevron_right قائمة بجميع الأعضاء

chevron_right لغات الأعضاء

chevron_right المتحدثون الأصليون

search
clear
swap_horiz
search
TRANG TRANG ٧ ديسمبر ٢٠١٨ ٧ ديسمبر ٢٠١٨ ١٠:٥٦:٥١ م UTC link Permalink

Anyone would be interesting in running an analysis on the quality of Tatoeba's corpus? The question to answer would be: what's the percentage of our sentences that can be safely considered as good/correct?

You could go for an empirical approach, for instance by looking through a random sample of 1000 sentences in your native language, reviewing them one by one and count how many of them have a mistake (and of course post a comment when you find mistakes). Let's say you find 50 sentences that have mistakes, it would suggest that potentially 95% of the sentences in that language are good and 5% are bad.

Or you could go elaborate your own criteria on how to detect that a sentence is good (if it's tagged "OK", if it's rated as "OK", if it has been commented and corrected, if it has audio, etc). Make some script to count the sentences that match your criteria, run it against the data from the Downloads page, and get some numbers of out it.

These are just some ideas. You can come up with any other approach that makes sense to you.

The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

Either way, I think it's an interesting question a lot of us would be interested to know the answer of, so I'm throwing out the idea and I hope it'll catch the interest of some of you :)

{{vm.hiddenReplies[30930] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
Ricardo14 Ricardo14 ٨ ديسمبر ٢٠١٨ ٨ ديسمبر ٢٠١٨ ١٢:٣٥:٤١ ص UTC link Permalink

I'm interested.
My approach would be: Tag as "OK" sentences which languages I speak and ask some other members to do the same.
Then we could compare how many of them were tagged (are good sentences) and how many weren't (*maybe* are not good sentences).

belkacem77 belkacem77 ٨ ديسمبر ٢٠١٨ ٨ ديسمبر ٢٠١٨ ٨:٥٩:٥٩ م UTC link Permalink

For Kabyle, I suggest @Amazigh_Bedar and he can suggest more names to help.
Bedar is a linguist (Kabyle and other berber languages)

deniko deniko ٨ ديسمبر ٢٠١٨, edited ٨ ديسمبر ٢٠١٨ ٨ ديسمبر ٢٠١٨ ٩:٥٩:١١ م UTC, edited ٨ ديسمبر ٢٠١٨ ١٠:٠٠:٤٩ م UTC link Permalink

I'd say that whoever checks the sentences for their languages shouldn't be an active contributor, or at least not one of the top contributors for their language.

For example, for Ukrainian I would be mostly checking my own sentences. Which is not that bad, it's proofreading, but it's always makes more sense to have a fresh look, and it's more efficient.

Also, I think there could be a lot of categories of "good" and "bad" sentences.

For example, depending on the context, I can classify as good either some subset of the list below, or all of them:

1. Sounds like something a native speaker would say, no spelling mistakes, no punctuation mistakes.

2. Sounds like something a native speaker would say, no spelling mistakes, some problems.with punctuation (missing comma, missing full stop at the end, unnecessary comma, etc.)

3. Sounds like something a native speaker would say, contains a minor typo. (Tom has an elephannt.)

4 Sounds like something a native speaker would say, contains a spelling mistake typical to native speakers ("I know more then you.", "I would of done it.", "Your my friend.")

5. Sounds awkward, but still acceptable.

6. Sounds like something a native speaker of a different dialect of my language would say.

{{vm.hiddenReplies[30936] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
Aiji Aiji ٩ ديسمبر ٢٠١٨ ٩ ديسمبر ٢٠١٨ ٦:١٩:١٧ ص UTC link Permalink

I'm part of those who do not believe in "good" sentences. That is a very bad word, impractical to use.

That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?
In 2, if the punctuation is really erroneous, that is not simply a matter of appreciation but a real mistake in the sentence, it needs to be corrected.
Similarly, 3 and 4 need to be corrected.

I think I see your point, but from the experiment point of view related to what TRANG was talking about, they would weigh as "bad" sentences in the balance to extrapolate to the whole corpus (of course, that is not possible, but I think you see what I mean).

As for proofreading, I agree with you. Proofreading is best done with a fresh set of mind.

{{vm.hiddenReplies[30937] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
CK CK ١٠ ديسمبر ٢٠١٨, edited ٣١ أكتوبر ٢٠١٩ ١٠ ديسمبر ٢٠١٨ ١:٠١:٤٨ ص UTC, edited ٣١ أكتوبر ٢٠١٩ ٣:٢٣:١١ ص UTC link Permalink

[not needed anymore- removed by CK]

{{vm.hiddenReplies[30941] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
deniko deniko ١٠ ديسمبر ٢٠١٨ ١٠ ديسمبر ٢٠١٨ ٩:١٣:٠٦ ص UTC link Permalink

> If by "still acceptable", you mean to include obviously incorrect language use

I meant "awkward, but something a native speaker might write, and not correct themselves after re-reading it".

It's really difficult to come up with English sentences like this to me because I'm nowhere near native level, but in Ukrainian I've been coming across a lot of examples of such phrases. Basically, "bad style" or something like that.

deniko deniko ١٠ ديسمبر ٢٠١٨, edited ١٠ ديسمبر ٢٠١٨ ١٠ ديسمبر ٢٠١٨ ٩:١١:٠٠ ص UTC, edited ١٠ ديسمبر ٢٠١٨ ٩:١٣:٣٨ ص UTC link Permalink

> That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?

Yes, they need to be corrected, but they are still good according to my low standards. My standard is: if it can pass for a sentence written by a sober native speaker who had re-read it twice - it's good enough.

So I just wanted to know what the standards for "good" are. Something that needs correction? Then any single fault makes it "bad". Which is fine by me, "good" and "bad" are too broad of terms.

Basically, I would like to distinguish between something like "How much time?" (intended meaning - "What time is it?") and a genuine typo or even an error that a lot of native speaker make.

But if the idea is to put both "How much time?" and "I'm taller then him." into the same basket because they both need correction, I'd be fine with that as well.

soliloquist soliloquist ٩ ديسمبر ٢٠١٨, edited ٩ ديسمبر ٢٠١٨ ٩ ديسمبر ٢٠١٨ ٧:٢٩:٥٦ م UTC, edited ٩ ديسمبر ٢٠١٨ ٨:٠٩:٠٠ م UTC link Permalink

I have finished checking 1000 random Turkish sentences. The results are as follows.

- Sentences in good condition (no errors & natural-sounding) : 764 (76.4%)

- Sentences owned by non-native speakers: 3 (0.3%)

- Sentences with spelling or punctuation errors (accent letters are not taken into account) : 43 (4.3%)

- Sentences with other grammatical errors: 18 (1.8%)

- Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

- Partly unnatural sentences (excessive pronoun usage, improvable word choices etc. - sentences that don't sound smooth) : 121 (12.1%)

- Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

They are more than 1000 (100%) in total because some of them fall into more than one category.

{{vm.hiddenReplies[30940] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
deniko deniko ١٠ ديسمبر ٢٠١٨, edited ١٠ ديسمبر ٢٠١٨ ١٠ ديسمبر ٢٠١٨ ٩:١٩:٤١ ص UTC, edited ١٠ ديسمبر ٢٠١٨ ٩:٢٠:٠١ ص UTC link Permalink

You've done a great job, and I liked how you created a lot of categories instead of just saying "good" or "not good".

> Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

If we started checking this within the task defined by Trang, it would be impossible to complete.

Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:

> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

{{vm.hiddenReplies[30945] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
soliloquist soliloquist ١٠ ديسمبر ٢٠١٨ ١٠ ديسمبر ٢٠١٨ ٧:٣٥:٢٢ م UTC link Permalink

Thank you for your remarks, deniko.

> I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

Yes, most of them are directly-linked to English and most of the rest are indirectly-linked to English, so I didn't have much difficulty checking them. Sometimes I looked to Google Translate and Glosbe for sentences other than English to solidify my decisions. If I encountered an isolated unusual pair like Marathi-Turkish I would probably skip it.


> Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think English is a special case. Many of the English sentences are original. I don't think evaluating the quality of a corpus with translations linked to it later would be very fair, but I accept that's a logical dilemma. A cooperative and multilingual work is needed here. As Goethe said, let everyone sweep in front of his own door and the whole world will be clean.


> From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:
> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

I beg to differ. Some of those 'partly unnatural' sentences might be acceptable for language books to show/stress some aspects of the language, but from a purely native standpoint, they're not much different from the 'completely unnatural' sentences. In novel translation, for instance, such sentences would irritate readers. From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

{{vm.hiddenReplies[30946] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
deniko deniko ١١ ديسمبر ٢٠١٨ ١١ ديسمبر ٢٠١٨ ٩:٢٩:٣٢ ص UTC link Permalink

> From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

Of course it is. I don't think we have ever argued about that.

However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is. Not everybody is able to speak and write clearly and smoothly. So I thought you meant your "Partly unnatural sentences" were in that category.

If they clearly sounded like something a native speaker would never say, then it's just unnatural sentence, in my opinion.

{{vm.hiddenReplies[30948] ? 'expand_more' : 'expand_less'}} أخفِ الردود أظهر الردود
soliloquist soliloquist ١١ ديسمبر ٢٠١٨ ١١ ديسمبر ٢٠١٨ ٧:٣٦:٢٥ م UTC link Permalink

> However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is.
> So I thought you meant your "Partly unnatural sentences" were in that category.

The difference may seem unclear, but I'd rather call them 'non-standard'. Other than that, we're on the same page, I think.

Aiji Aiji ١٢ ديسمبر ٢٠١٨ ١٢ ديسمبر ٢٠١٨ ١٢:١٣:٣٢ ص UTC link Permalink

That's an interesting discussion, but we can go down forever and ever ^^
We can split "awkward" sentences into correct awkward, academic awkward, regional awkward, etc. With Tatoeba current system, they probably should be tagged, but not everybody thinks about it, or can, and in the absolute tags are here to give meta-information and help the person who would not understand why such a sentence is correct.

A simple example: The French « C'est quelle heure ? » is (clearly) grammatically incorrect. However, this is the natural way to ask the time in some regions... Hence, even "natural-sounding" sentences are up to debate. Etc. :)

AlanF_US AlanF_US ١٢ ديسمبر ٢٠١٨, edited ١٢ ديسمبر ٢٠١٨ ١٢ ديسمبر ٢٠١٨ ٤:٥٤:٢٧ ص UTC, edited ١٢ ديسمبر ٢٠١٨ ١٢:٤٩:٠٧ م UTC link Permalink

> The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

I would say the following:
(1) the goal of extracting patterns on what helps and what doesn't is more important than trying to come up with a single number
(2) coming up with a number doesn't get us much closer to figuring out what helps and what doesn't
(3) if you have X people doing the counting, you'll have X systems for doing it

Having said that, I would like to toss into the mix my opinions about what makes a sentence valuable to me.

First of all, I want to point people to the wiki page "How To Write Good Sentences" ( https://en.wiki.tatoeba.org/art...ood-sentences# ), which captures my feeling that a good sentence meets the following criteria:
- clear
- self-contained
- likely
- standard dialect
- natural
- unlikely to offend

For my purposes, a really useful sentence meets these criteria and goes even further:
- contains words like "surprised" that are in the second or third tier of frequency, meaning that they're important but neither among the very first words learned (like "water") nor among the rare and quirky words that even native speakers are unlikely to use (like "whortleberry")
- demonstrates the meaning of the rarest/most advanced words in the sentence
- has some interesting characteristic that makes it easy to remember
- is not irritating

My model of an ideal sentence is (1) one that would be found in a good dictionary whose entries contain sample sentences OR (2) one that would be found in a good textbook, like the kind from which I learned languages in the 1980s and 1990s, before the online era but after the age during which the focus was on drilling or lecturing the student. (I've seen textbooks from the 1930s and 1940s with series of sentences like "The first month is January. The second month is February." or "Jane is a diligent student. She always does her homework." I think language instruction improved greatly after that time.) I see "textbook sentences" criticized here from time to time as being unnatural, but I believe that the critics are thinking of textbooks like the kind published in Asia by non-native speakers to teach English. In my experience, many language textbooks are written by people who are good at coming up with fresh, novel, natural sentences that demonstrate language as it is actually used.

Sentences of the form "I am [adjective]. You are [same adjective]. He is [same adjective]." are not particularly helpful to me. They're not interesting, and they don't demonstrate what the adjective means. Tatoeba is different from other crowd-sourced multilingual dictionaries in that it can provide sentences that give people a sense of the flavor of the words they contain. Sentences that match a simple template do not fulfill the site's potential. However, I realize that it's hard work to come up with sentences of the particularly useful kind I'm describing, and I myself spend far more time translating sentences that other people have written than coming up with new ones.

soliloquist soliloquist ١٢ ديسمبر ٢٠١٨ ١٢ ديسمبر ٢٠١٨ ٨:٤٥:٥٥ م UTC link Permalink

The following are numbers of comments per 1000 sentences for the top 10 languages.

- English: 67
- Russian: 82
- Turkish: 24
- Italian: 14
- Esperanto: 180
- German: 177
- French: 143
- Portuguese: 60
- Spanish: 137
- Hungarian: 88

I know it's far from being a precise and reliable source to make a judgement as comments reporting wrong-flag errors or giving annotations and some other factors like low number of active speakers decrease its efficiency, but still it might give a rough idea about how actively and effectively sentences in these languages are checked and maintained.