clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search

الحائط (موضوع واحد)

keyboard_arrow_left 1234567...520
onurussia
قبل أقل من ساعة
İs tehere a problem? I can not searching :(
Ricardo14
قبل ساعة واحدة
Is anyone else experiencing problems on Tatoeba? I mean, I'm not able to post sentences, it takes forever to load the home page and also the "Randon sentence" is not in there. Is this an ongoing upgrade?
أخفِ الردود
Hybrid
قبل ساعة واحدة - قبل ساعة واحدة
+1 I'm getting automatically logged out of Tatoeba very quickly. I've been experiencing this problem for the past few months, but today Tatoeba is almost unusable.
belkacem77
قبل ساعة واحدة
There is an update announced on Github
TRANG
أمس
Anyone would be interesting in running an analysis on the quality of Tatoeba's corpus? The question to answer would be: what's the percentage of our sentences that can be safely considered as good/correct?

You could go for an empirical approach, for instance by looking through a random sample of 1000 sentences in your native language, reviewing them one by one and count how many of them have a mistake (and of course post a comment when you find mistakes). Let's say you find 50 sentences that have mistakes, it would suggest that potentially 95% of the sentences in that language are good and 5% are bad.

Or you could go elaborate your own criteria on how to detect that a sentence is good (if it's tagged "OK", if it's rated as "OK", if it has been commented and corrected, if it has audio, etc). Make some script to count the sentences that match your criteria, run it against the data from the Downloads page, and get some numbers of out it.

These are just some ideas. You can come up with any other approach that makes sense to you.

The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

Either way, I think it's an interesting question a lot of us would be interested to know the answer of, so I'm throwing out the idea and I hope it'll catch the interest of some of you :)
أخفِ الردود
Ricardo14
أمس
I'm interested.
My approach would be: Tag as "OK" sentences which languages I speak and ask some other members to do the same.
Then we could compare how many of them were tagged (are good sentences) and how many weren't (*maybe* are not good sentences).
CK
CK
أمس - أمس
1. If you assume that all native-speaker sentences are correct and natural-sounding, then based on 2018-11-24 data, it would be at least 78%.

78% (5611756/7139827)
Based on this http://tatoeba.byethost3.com/st...018-11-24.html

However, not all native-speaker sentences are good, since some have typos, or are non-natural-sounding, word-for-word translations.

The number would be higher, though, since some non-native sentences are good, and we have a number of sentences in Esperanto, Latin and other constructed and dead languages.

If you want to get an idea of the number of sentences in each language by native speakers, see http://bit.ly/nativespeakers
(Last updated October 13, 2018)


2. For English, I would say that the quality is at least 61%.

61% (702027/1149697)
Based on https://tatoeba.org/eng/sentenc...s/show/907/und

However, the actual score would be higher.
I don't add near-duplicate sentences to this list.
For example, I added these.
* Tom was desperate for attention.
* That's not going to fly with Tom.
But, I didn't add these.
* Mary was desperate for attention.
* That's not going to fly with Mary.

I don't add old-fashioned and archaic sentences to List 907.
I try not to add sentences that are potentially offensive and not appropriate for all ages and cultures.
And, I don't read very, very long, multi-sentence contributions.


3. 13% (917620/7183699) of our sentences had good ratings in the 2018-12-08 exported data.

Note that some members rate sentences in non-native languages, so perhaps this data can't be trusted too much. Part of this problem, perhaps, is that the rating system is called "collections."
See the number of ratings and which members have rated sentences "good.".
http://tatoeba.ueuo.com/2018-12...od-ratings.txt
belkacem77
أمس
For Kabyle, I suggest @Amazigh_Bedar and he can suggest more names to help.
Bedar is a linguist (Kabyle and other berber languages)
deniko
أمس - أمس
I'd say that whoever checks the sentences for their languages shouldn't be an active contributor, or at least not one of the top contributors for their language.

For example, for Ukrainian I would be mostly checking my own sentences. Which is not that bad, it's proofreading, but it's always makes more sense to have a fresh look, and it's more efficient.

Also, I think there could be a lot of categories of "good" and "bad" sentences.

For example, depending on the context, I can classify as good either some subset of the list below, or all of them:

1. Sounds like something a native speaker would say, no spelling mistakes, no punctuation mistakes.

2. Sounds like something a native speaker would say, no spelling mistakes, some problems.with punctuation (missing comma, missing full stop at the end, unnecessary comma, etc.)

3. Sounds like something a native speaker would say, contains a minor typo. (Tom has an elephannt.)

4 Sounds like something a native speaker would say, contains a spelling mistake typical to native speakers ("I know more then you.", "I would of done it.", "Your my friend.")

5. Sounds awkward, but still acceptable.

6. Sounds like something a native speaker of a different dialect of my language would say.
أخفِ الردود
Aiji
أمس
I'm part of those who do not believe in "good" sentences. That is a very bad word, impractical to use.

That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?
In 2, if the punctuation is really erroneous, that is not simply a matter of appreciation but a real mistake in the sentence, it needs to be corrected.
Similarly, 3 and 4 need to be corrected.

I think I see your point, but from the experiment point of view related to what TRANG was talking about, they would weigh as "bad" sentences in the balance to extrapolate to the whole corpus (of course, that is not possible, but I think you see what I mean).

As for proofreading, I agree with you. Proofreading is best done with a fresh set of mind.
أخفِ الردود
CK
CK
أمس
Also, 5. Sounds awkward, but still acceptable. ....

If by "still acceptable", you mean to include obviously incorrect language use, but still able to communicate the speaker's intention, I would consider these not "good" for the Tatoeba Corpus.

In real life, these are totally acceptable when communicating with friends and maybe in many other situations, but I don't think many of us would think they are appropriate for anyone who wants to use them to study a language.
أخفِ الردود
deniko
أمس
> If by "still acceptable", you mean to include obviously incorrect language use

I meant "awkward, but something a native speaker might write, and not correct themselves after re-reading it".

It's really difficult to come up with English sentences like this to me because I'm nowhere near native level, but in Ukrainian I've been coming across a lot of examples of such phrases. Basically, "bad style" or something like that.
deniko
أمس - أمس
> That being said, following Tatoeba's rules, why would you classify as "good" any of 2, 3, or 4?

Yes, they need to be corrected, but they are still good according to my low standards. My standard is: if it can pass for a sentence written by a sober native speaker who had re-read it twice - it's good enough.

So I just wanted to know what the standards for "good" are. Something that needs correction? Then any single fault makes it "bad". Which is fine by me, "good" and "bad" are too broad of terms.

Basically, I would like to distinguish between something like "How much time?" (intended meaning - "What time is it?") and a genuine typo or even an error that a lot of native speaker make.

But if the idea is to put both "How much time?" and "I'm taller then him." into the same basket because they both need correction, I'd be fine with that as well.
soliloquist
أمس - أمس
I have finished checking 1000 random Turkish sentences. The results are as follows.

- Sentences in good condition (no errors & natural-sounding) : 764 (76.4%)

- Sentences owned by non-native speakers: 3 (0.3%)

- Sentences with spelling or punctuation errors (accent letters are not taken into account) : 43 (4.3%)

- Sentences with other grammatical errors: 18 (1.8%)

- Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

- Partly unnatural sentences (excessive pronoun usage, improvable word choices etc. - sentences that don't sound smooth) : 121 (12.1%)

- Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

They are more than 1000 (100%) in total because some of them fall into more than one category.
أخفِ الردود
deniko
أمس - أمس
You've done a great job, and I liked how you created a lot of categories instead of just saying "good" or "not good".

> Sentences with translation errors (mistranslated words, tense errors, pronoun errors etc. ) : 37 (3.7%)

If we started checking this within the task defined by Trang, it would be impossible to complete.

Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:

> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)
أخفِ الردود
soliloquist
أمس
Thank you for your remarks, deniko.

> I think most of Turkish sentences are linked to English ones, but what if you come across something linked to say Ukrainian or Marathi, would you check the translation as well?

Yes, most of them are directly-linked to English and most of the rest are indirectly-linked to English, so I didn't have much difficulty checking them. Sometimes I looked to Google Translate and Glosbe for sentences other than English to solidify my decisions. If I encountered an isolated unusual pair like Marathi-Turkish I would probably skip it.


> Imagine a native speaker of English trying to determine which sentences are "good", and also check whether all 100 translations into 50 languages of each sentence is correct.

I think English is a special case. Many of the English sentences are original. I don't think evaluating the quality of a corpus with translations linked to it later would be very fair, but I accept that's a logical dilemma. A cooperative and multilingual work is needed here. As Goethe said, let everyone sweep in front of his own door and the whole world will be clean.


> From my point of view, and according to my low standards, I'd say 94% of Turkish sentences are good sentences. This is the only category of "bad" sentences, IMHO:
> Completely unnatural sentences (literal translations, very strange word choices/orders etc. - sentences that a native speaker would never say) : 61 (6.1%)

I beg to differ. Some of those 'partly unnatural' sentences might be acceptable for language books to show/stress some aspects of the language, but from a purely native standpoint, they're not much different from the 'completely unnatural' sentences. In novel translation, for instance, such sentences would irritate readers. From my point of view, a good sentence should be indistinguishable whether it is original or a translation.
أخفِ الردود
CK
CK
أمس - أمس
> From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

I agree with this.

If the purpose of the Tatoeba Corpus, which I believe it is, is to provide people with sentences worth studying to learn a language, then this is something we should all strive for.

My suggested guidelines are as follows.

* Only translate into your native language.

* Only translate things you are 100% sure of.
** If you aren't sure, just skip the sentence.

* Only create natural-sounding sentences.
** Remember that people studying your language will study your sentences.
** Even if you know what it means, but can't make a natural-sounding sentence for the translation, just skip it.

* If you think the sentence is strange, don't translate it.
deniko
أمس
> From my point of view, a good sentence should be indistinguishable whether it is original or a translation.

Of course it is. I don't think we have ever argued about that.

However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is. Not everybody is able to speak and write clearly and smoothly. So I thought you meant your "Partly unnatural sentences" were in that category.

If they clearly sounded like something a native speaker would never say, then it's just unnatural sentence, in my opinion.
أخفِ الردود
soliloquist
أمس
> However, even without translation native speakers quite often produce awkwardly sounding sentences that they can either correct themselves, or leave as is.
> So I thought you meant your "Partly unnatural sentences" were in that category.

The difference may seem unclear, but I'd rather call them 'non-standard'. Other than that, we're on the same page, I think.
Aiji
أمس
That's an interesting discussion, but we can go down forever and ever ^^
We can split "awkward" sentences into correct awkward, academic awkward, regional awkward, etc. With Tatoeba current system, they probably should be tagged, but not everybody thinks about it, or can, and in the absolute tags are here to give meta-information and help the person who would not understand why such a sentence is correct.

A simple example: The French « C'est quelle heure ? » is (clearly) grammatically incorrect. However, this is the natural way to ask the time in some regions... Hence, even "natural-sounding" sentences are up to debate. Etc. :)
AlanF_US
أمس - أمس
> The goal is really not to get precise measurements but to get a sense of how much of our corpus is actually good: is it 80%, is it 90%, is it 99%? It would be quite interesting to compare between languages as well. Then we could perhaps try to extract some patterns on what helps and what doesn't help in making a good quality corpus.

I would say the following:
(1) the goal of extracting patterns on what helps and what doesn't is more important than trying to come up with a single number
(2) coming up with a number doesn't get us much closer to figuring out what helps and what doesn't
(3) if you have X people doing the counting, you'll have X systems for doing it

Having said that, I would like to toss into the mix my opinions about what makes a sentence valuable to me.

First of all, I want to point people to the wiki page "How To Write Good Sentences" ( https://en.wiki.tatoeba.org/art...ood-sentences# ), which captures my feeling that a good sentence meets the following criteria:
- clear
- self-contained
- likely
- standard dialect
- natural
- unlikely to offend

For my purposes, a really useful sentence meets these criteria and goes even further:
- contains words like "surprised" that are in the second or third tier of frequency, meaning that they're important but neither among the very first words learned (like "water") nor among the rare and quirky words that even native speakers are unlikely to use (like "whortleberry")
- demonstrates the meaning of the rarest/most advanced words in the sentence
- has some interesting characteristic that makes it easy to remember
- is not irritating

My model of an ideal sentence is (1) one that would be found in a good dictionary whose entries contain sample sentences OR (2) one that would be found in a good textbook, like the kind from which I learned languages in the 1980s and 1990s, before the online era but after the age during which the focus was on drilling or lecturing the student. (I've seen textbooks from the 1930s and 1940s with series of sentences like "The first month is January. The second month is February." or "Jane is a diligent student. She always does her homework." I think language instruction improved greatly after that time.) I see "textbook sentences" criticized here from time to time as being unnatural, but I believe that the critics are thinking of textbooks like the kind published in Asia by non-native speakers to teach English. In my experience, many language textbooks are written by people who are good at coming up with fresh, novel, natural sentences that demonstrate language as it is actually used.

Sentences of the form "I am [adjective]. You are [same adjective]. He is [same adjective]." are not particularly helpful to me. They're not interesting, and they don't demonstrate what the adjective means. Tatoeba is different from other crowd-sourced multilingual dictionaries in that it can provide sentences that give people a sense of the flavor of the words they contain. Sentences that match a simple template do not fulfill the site's potential. However, I realize that it's hard work to come up with sentences of the particularly useful kind I'm describing, and I myself spend far more time translating sentences that other people have written than coming up with new ones.
soliloquist
أمس
The following are numbers of comments per 1000 sentences for the top 10 languages.

- English: 67
- Russian: 82
- Turkish: 24
- Italian: 14
- Esperanto: 180
- German: 177
- French: 143
- Portuguese: 60
- Spanish: 137
- Hungarian: 88

I know it's far from being a precise and reliable source to make a judgement as comments reporting wrong-flag errors or giving annotations and some other factors like low number of active speakers decrease its efficiency, but still it might give a rough idea about how actively and effectively sentences in these languages are checked and maintained.
CK
CK
قبل ساعة واحدة
If you think that all native speaker contributions can be trusted, then here is a table that will give a minimum score for each language.

http://tatoeba.ueuo.com/stats18...agenative.html
CK
CK
قبل أقل من يوم
We now have over 518,000 audio files, up a little over 18,000 since November 12, 2018.

https://tatoeba.org/eng/audio/index

You can find each member's audio list with this search.
The lists with the most-recent changes are at the top.
https://tatoeba.org/eng/sentenc...direction:desc

See the last wall post about audio additions.
https://tatoeba.org/eng/wall/show_message/30710
CK
CK
أمس
** Note to Spanish-English Advanced Contributors **

I suspect that many of these Spanish sentences with audio can be linked to indirectly-linked English sentences. If you have time, please help by linking those that match in meaning.

https://tatoeba.org/eng/sentenc.../show/6685/und
Ramin88
أمس
Hello friends!
I am Ramin looking for a native English speaker.
أخفِ الردود
soliloquist
أمس
Hi, Ramin.

Tatoeba doesn't work like HiNative or Lang-8. However, you can contribute by translating sentences into Persian or by creating original Persian sentences.

Here are the English sentences that are not translated into Persian.

https://tatoeba.org/eng/sentenc...o=&sort=random

I see you set Turkey as your country on your profile. If you're living in Turkey you can also try translating Turkish sentences into Persian. As neighbor countries we have so few linked Turkish-Persian sentences here.

https://tatoeba.org/eng/sentenc...io=&sort=words
Amastan
أمس
Tamazight/Berber language resources

Dictionnaire français-tamazight de génie électrique
Mohand Mahrazi (University of Bejaia, Algeria)

Google books:
http://bit.ly/2Qa9dbe

French-Tamazight Dictionary of Electrical Engineering

This book will be very helpful in the translation of many scientific sentences into Tamazight. This one of the best reference dictionaries for technical terms.
Ricardo14
أمس
Happy Friday!
أخفِ الردود
AlanF_US
أمس
Same to you, Ricardo!
أخفِ الردود
Ricardo14
أمس
Thanks, my friend!
sharptoothed
أمس
** Stats & Graphs **

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/
أخفِ الردود
Aiji
أمس
Merci merci
Guybrush88
أمس
thanks :-)
Ricardo14
أمس
Muito obrigado!
miketheknight
أمس
I think there's something off with your Top 10 Most Active Languages Graph.

https://tatoeba.j-langtools.com/graphs.html

The first one.

Javanese: 294 sentences in total, +2 sentences last week.

Indonesian doesn't seem to be very active either.

Berber, on the other hand, has +4,800 sentences last week, and it's not on that graph.
أخفِ الردود
sharptoothed
أمس
Sorry for the late response. All stats and graphs are based upon weekly Tatoeba database dumps. The last dump was performed at 9:00 (GMT) on December,1st, so "December" columns on the Top 10 Most Active Languages Graph show only those few sentences added till that time.
أخفِ الردود
miketheknight
أمس
I see, thanks for the explanation.

This also probably explains those plateaus here:

https://i.imgur.com/4NQSD8S.png

If 9 hours of the first day of December represent the whole month of December on the graphs it does make sense. It's like everyone almost stopped contributing in December comparing to the previous months.
أخفِ الردود
sharptoothed
أمس
That's correct. Consider data for the current month as approximation. :-)
CK
CK
أمس
Daily Contribution Stats All On One Page

http://tatoeba.ueuo.com/timeline/

See how many sentences were contributed each day since the beginning of the project.
keyboard_arrow_left 1234567...520