الملف الشخصي
الجُمل
المفردات
Reviews
القوائم
المفضلة
التعليقات
التعليقات على جمل lbdx
رسائل الحائط
السجلات
تسجيل صوتي
المدوّنات
ترجِم جمل lbdx
The number of monthly sentence owners fell from 350-400 between 2012 and 2016 to 250-300 between 2017 and 2023. I don't have the details by language.
Trang, thank you for reopening the debate on this important issue.
My view on this has evolved slightly. I now think it would be simpler and more understandable to also include derived sentences in this rate limit of 3,000 sentences per language per month. Sentence counts would be reset at the beginning of each month. Once the limit has been reached, the user would not be allowed to add any more sentences until the following month. I prefer a monthly rate limit because it doesn't penalise users who don't contribute every day or every week.
Note that I'm not against the occasional import of other corpora into Tatoeba as long as they are lexically balanced and composed of sentences that are useful for language learners.
According to linguists, Berber is not a language but a group of languages [1]. Consequently "ber" is an ISO 639-5 language code but not an ISO 639-3 language code. That is probably why Berber has been declined by Wikipedia.
Tatoeba also does not accept languages that do not have an ISO 639-3 code, but an exception was made for Berber. In hindsight, this was probably not a good idea. It creates overlap and harmful competition with other Berber languages' corpora such as Kabyle.
[1] https://en.wikipedia.org/wiki/Berber_languages
The years 2017 and 2018 were years in which Tatoeba's main English-speaking contributor added hundreds of thousands of sentences in bulk.These sentences were mostly built according to syntactic patterns and used wildcards to avoid creating paraphrases that differ only in their named entities. These massive additions have greatly reduced the lexical diversity of the English corpus and increased the proportion of sentences containing pervasive words from 20% to 40%. This sudden change coincides with a sharp drop in the number of active contributors to Tatoeba.
The introduction of rate limits for sentence additions would prevent such a flood from happening again.
** Pruned/Rebalanced Lists **
Rebalanced lists are lexical filters that provide a more varied and balanced view of the Tatoeba Corpus. They prohibit a word from occurring more than 10 times as often as in a reference corpus. Long sentences of more than 15 words have little success with translators and are therefore systematically pruned. The most recent sentences are pruned before older ones. The words targeted are usually pervasive named entities that are used extensively by a few Tatoebans, and not relevant across languages.
10 major languages on Tatoeba are currently supported:
- English: https://tatoeba.org/en/sentence...=1&orphans=any
- French: https://tatoeba.org/en/sentence...=1&orphans=any
- German: https://tatoeba.org/en/sentence...=1&orphans=any
- Italian: https://tatoeba.org/en/sentence...=1&orphans=any
- Japanese: https://tatoeba.org/en/sentence...=1&orphans=any
- Mandarin Chinese: https://tatoeba.org/en/sentence...=1&orphans=any
- Portuguese: https://tatoeba.org/en/sentence...=1&orphans=any
- Russian: https://tatoeba.org/en/sentence...=1&orphans=any
- Spanish: https://tatoeba.org/en/sentence...=1&orphans=any
- Turkish: https://tatoeba.org/en/sentence...=1&orphans=any
All rebalanced lists are updated automatically every Saturday.
> a lot of contributors just translate without seeing the errors
It's tricky to spot errors in a language you don't fully master. That's why it's so important to visit the sentence page before translating it and check whether any problems have been reported. It is also essential to assess the reliability of the authors you are translating.
To determine whether a wording is commonly used, I recommend taking a look at the number of exact matches in Google Books.
Many thanks to the corpus maintainers who volunteer their time to correct all these errors.
Monthly updates adopt a weekly schedule ✨
Every Saturday morning, all my lists are now automatically updated from the cloud 🤖
- Tatominer https://tatominer.netlify.app
- Spread by Tatoebans ✨ https://tatoeba.org/en/sentences_lists/show/170280
- Pruned English ✂️ https://tatoeba.org/en/sentences_lists/show/171182
- Rated as 'not OK' 🔴 https://tatoeba.org/en/sentences_lists/show/170380
- Rated as 'unsure' 🟠 https://tatoeba.org/en/sentences_lists/show/170383
- JMdict - Japanese 🇯🇵 https://tatoeba.org/en/sentences_lists/show/171073
- JMdict - English 🇬🇧 https://tatoeba.org/en/sentences_lists/show/171072
- Tatolead https://tatolead.netlify.app
More information about these tools at my profile page: https://tatoeba.org/en/user/profile/lbdx
No doubt it will join your 61,000 sentences that are already on the list :):
https://tatoeba.org/en/sentence...rd_count_min=1
No, because #12172884 is derived from one of mhr's German sentences, and then post-linked to your sentence.
"Spread by Tatoebans" is a multilingual list of sentences that have already significantly spread on Tatoeba.
The sentences of this subset tend to be more universal and to have a more dependable wording and spelling than other sentences on Tatoeba.
To enter the list, a sentence must have at least two links to sentences in other languages and from other contributors. Orphan sentences and post-linking are not taken into account.
This means that an original sentence must appeal to at least two speakers of two different languages to be selected. On the other hand, a derived sentence only needs to be retranslated once (by a third member into a third language).
** December 2023 Updates **
- Tatominer https://tatominer.netlify.app
- Tatolead https://tatolead.netlify.app
- Spread by Tatoebans ✨ https://tatoeba.org/en/sentences_lists/show/170280
- Rated as 'not OK' 🔴 https://tatoeba.org/en/sentences_lists/show/170380
- Rated as 'unsure' 🟠 https://tatoeba.org/en/sentences_lists/show/170383
- JMdict - Japanese 🇯🇵 https://tatoeba.org/en/sentences_lists/show/171073
- JMdict - English 🇬🇧 https://tatoeba.org/en/sentences_lists/show/171072
More information about these tools at my profile page: https://tatoeba.org/en/user/profile/lbdx
> What's the number of times a word should appear in the corpus for it to be filterd out by this algorithm?
If he wasn't blocked, this question might have interested Amastan. About two-thirds of his English sentences—142,564 to be exact—have been "pruned" to build this filter list. It might be tempting to stop injecting his favorite pervasive words just below thresholds...
But wait, it seems that Amastan is contributing as always:
- account created recently
- native speaker of Kabyle/Berber/French/English/Arabic
- translates almost exclusively Amastan's English sentences
Please admins, don't let Amastan keep up his vandalism under a new identity!
# Small tip
Note that thanks to gillux's work, you can download any list of sentences (translated into any language) directly from tatoeba.org: https://tatoeba.org/en/sentence...wnload/171446. Go to the page of the list of your choice and click on the "Download this list" icon.
You can then import this list into Anki, keeping just one translation per sentence. And if you also want audios, you can generate them automatically using the HyperTTS add-on: https://ankiweb.net/shared/info/111623432.
The "Pruned English" list provides a more varied and balanced view of Tatoeba. It gathers all sentences of maximum 15 words that do not contain some words or sequences of words algorithmically classified as "pervasive". The pervasiveness of a text fragment is a function of its frequency, overrepresentation and informativeness. This simple filter eliminates almost half of the English sentences.
The pervasive words detected are: tom, mary, ziri, sami, yanni, rima, layla, skura, mennad, algeria, boston, berber, french, fadil, algiers, kabylie, kabyle, boldi, tatoeba, baya, algerian, benedito, edmundo, flavio, damiano, nuja, kalman, swim, fyodor, janos, leonid, adriano, miroslav, gabor, dmitri, gustavo, martino, gunter, esperanto, walid, algerians, rodrigo, oleg, lukas, tobias, bicycle, elias, igor, claudio, isabella, lorenzo, alberto, boris, santiago, amelia, ivan, yuri, karl, mosque, vladimir, farid, chess, medlars, bejaia, pietro, quran, windshield, yidir, lajos, bouteflika, giraffes, tebboune, silya, mina, coronavirus, couscous, taninna, taller, hijab, pona, sahara, figs, jayjay, salima, fluently, heathers, kabyles, hurried, berbers, suitcases, fluent, yazid, bakir, shahada, ewe, hyena, toki, homesick, swam, dania, dung, ticklish, centipede, sociopaths, marika, punctual, tagalog, saxophone, raining, stefan, eaten, o'clock, carlos, giraffe, yen, rained, lojban, jugurtha, kyoto, snowing, daphnis, barking, fuji, snowed, hokkaido, yiddish, uranus, islam, maltese.
The pervasive sequences of 3 non-pervasive words are: 'to do that', 'said that he', 'do that by', "don't think that", 'told me that', 'that he thought', 'able to do', 'that by himself', 'go to australia', 'he thought that', 'said that they', 'do that today', 'i wonder whether', 'need to do', 'me that he', 'that by herself', 'said that she', 'needed to do', 'do that again', 'told me he', 'said he thought', "that he didn't", 'do that for', "didn't know that", "didn't do that", 'should do that', 'do that anymore', 'they said that', "said he didn't", 'could do that', 'needs to do', 'me that they', 'told me they', "didn't think that", 'they said they', 'that they were', 'told me she', "won't do that", 'me that she', 'thought that you', 'not to do', "didn't seem to", "didn't need to", 'seemed to be'.
Feel free to browse this list at https://tatoeba.org/en/sentence...&unapproved=no
** November 2023 Updates **
- Tatominer https://tatominer.netlify.app
- Tatolead https://tatolead.netlify.app
- Spread by Tatoebans ✨ https://tatoeba.org/en/sentences_lists/show/170280
- Rated as 'not OK' 🔴 https://tatoeba.org/en/sentences_lists/show/170380
- Rated as 'unsure' 🟠 https://tatoeba.org/en/sentences_lists/show/170383
- Pruned English ✂️ https://tatoeba.org/en/sentences_lists/show/171182
- JMdict - Japanese 🇯🇵 https://tatoeba.org/en/sentences_lists/show/171073
- JMdict - English 🇬🇧 https://tatoeba.org/en/sentences_lists/show/171072
More information about these tools at my profile page: https://tatoeba.org/en/user/profile/lbdx
** October 2023 Updates **
- Tatominer https://tatominer.netlify.app
- Tatolead https://tatolead.netlify.app
- Spread by Tatoebans ✨ https://tatoeba.org/en/sentences_lists/show/170280
- Rated as 'not OK' 🔴 https://tatoeba.org/en/sentences_lists/show/170380
- Rated as 'unsure' 🟠 https://tatoeba.org/en/sentences_lists/show/170383
- Pruned English ✂️ https://tatoeba.org/en/sentences_lists/show/171182
- JMdict - Japanese 🇯🇵 https://tatoeba.org/en/sentences_lists/show/171073
- JMdict - English 🇬🇧 https://tatoeba.org/en/sentences_lists/show/171072
More information about these tools at my profile page: https://tatoeba.org/en/user/profile/lbdx
** September 2023 Updates **
- Tatominer https://tatominer.netlify.app
- Tatolead https://tatolead.netlify.app
- Spread by Tatoebans ✨ https://tatoeba.org/en/sentences_lists/show/170280
- Rated as 'not OK' 🔴 https://tatoeba.org/en/sentences_lists/show/170380
- Rated as 'unsure' 🟠 https://tatoeba.org/en/sentences_lists/show/170383
- Pruned English ✂️ https://tatoeba.org/en/sentences_lists/show/171182
- JMdict - Japanese 🇯🇵 https://tatoeba.org/en/sentences_lists/show/171073
- JMdict - English 🇬🇧 https://tatoeba.org/en/sentences_lists/show/171072
More information about these tools at my profile page: https://tatoeba.org/en/user/profile/lbdx
Je suis ravi de te rendre service. #8708717
** August 2023 Updates **
I've just updated a few things that I built for Tatoeba:
- Tatominer https://tatominer.netlify.app
- Tatolead https://tatolead.netlify.app
- Spread by Tatoebans ✨ https://tatoeba.org/en/sentences_lists/show/170280
- Rated as 'not OK' 🔴 https://tatoeba.org/en/sentences_lists/show/170380
- Rated as 'unsure' 🟠 https://tatoeba.org/en/sentences_lists/show/170383
- Pruned English ✂️ https://tatoeba.org/en/sentences_lists/show/171182
- JMdict - Japanese 🇯🇵 https://tatoeba.org/en/sentences_lists/show/171073
- JMdict - English 🇬🇧 https://tatoeba.org/en/sentences_lists/show/171072
More information about these tools at my profile page: https://tatoeba.org/en/user/profile/lbdx
On Tatoeba, the vandalism of a few slowly outweighs the genuine efforts of the many 😢