Wall (7,000 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
urro
7 days ago
Augustus
8 days ago
coinxee
10 days ago
sharptoothed
12 days ago
LanguageExpert
15 days ago
changkuoth
15 days ago
Igider
15 days ago
samir_t
19 days ago
doemaar14
19 days ago
Warwari
19 days ago
The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.
The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.
![coinxee](/img/profiles_36/unknown-avatar.png?1713117292)
Is there an open-source English sentence database similar to Tatoeba?
![Augustus](/img/profiles_36/unknown-avatar.png?1713117292)
Mozilla's Common Voice is similar in collecting sentences and recordings thereof. It does not have the translation aspect of Tatoeba.
See https://commonvoice.mozilla.org/
![urro](/img/profiles_36/131422.png?1721252089)
If you just need English sentences, there are a few. However, I have looked myself, and found Tatoeba to be of the best quality, especially for English.
English-only:
• English Penn Treebank (Pennsylvania State University)
... is not something I know much about.
• English Web Treebank (Universal Dependencies)
... is mostly composed of biased sentence picks, but each has a grammatical breakdown. Stanford's NLP project Stanza uses it.
• Common Voice (Mozilla Foundation)
... as Augustus said!
With translation:
• OpenSubtitles2018 Corpus (OpenSubtitles)
... isn't very good for high-fidelity translation, but is rather natural, apart from its dramatizations.
Honorable mentions:
• Google Books Ngram Dataset (Google)
... only has a few languages. For example, their Japanese dataset is old and can only be accessed via purchase in yen.
• Wikipedia and Wiktionary (Wikimedia Foundation)
• Any other English (meta)corpora out there
https://www.google.com/search?q...s"%7C"dataset"
It really depends on your intentions and usage, as all corpora have their biases, unfortunately.
![sharptoothed](/img/profiles_36/73770df9e34f8021a0cebe3a71c1e2ef.png?1536268829)
✹✹ Stats & Graphs ✹✹
Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/
![changkuoth](/img/profiles_36/131338.png?1720704157)
I recently discovered that the Nuer language has been added to Google Translate, and I was thrilled to contribute, as I have always looked forward to this astonishing news. I have a passion for the Nuer language and have spoken it since birth. I can write, read, and speak Nuer fluently.
![LanguageExpert](/img/profiles_36/unknown-avatar.png?1713117292)
Yes, I've noticed that too! Also, I appreciate all the contributions that you've done on Tatoeba. I love languages, and I just thought I'd show you my appreciation for contributing sentences in Nuer. :) I enjoyed reading your story about your passion for Nuer.
![Igider](/img/profiles_36/80411.png?1606848742)
Azul, Hi,
(Kab) Wikigzawal - (Eng) Wikitionary*
Maca
(Kab) Wikipedia - (Eng) Wikipedia
Ilaq awalen-a n teglizit ad ten-nerr ar teqbaylit s unamek d uqaleb n teqbaylit.
*Takadimit taqvaylit (Kabyle academy)
![doemaar14](/img/profiles_36/69502.png?1632668099)
I recently discovered that Tamazight has been added to Google Translate. I'm wondering: did they use the incredible amount of Tamazight sentences here on Tatoeba?
If that's the case, I congratulate all of the Tamazight contributors on here.
If not, I wonder what data they trained their algorithm on.
![Yettutlay](/img/profiles_36/unknown-avatar.png?1713117292)
Thank you ❤️ Tanemmirt ❤️
![samir_t](/img/profiles_36/80480.png?1639767099)
Yes, Google Translate added what it called "tamazight" and used the both Tatoeba corpus (ber and kab), but in reality it is 99% Kabyle, which is why it poses a problem for its use, moreover Moroccans have already complained to Google, because the word Tamazight normally includes several North African languages including Algerian and Moroccan languages, and not just Kabyle.
![doemaar14](/img/profiles_36/69502.png?1632668099)
@samir_t Interesting. If it's not too much work, could you give me an example where Google Translate gives you a Kabyle sentence instead of ''standard Tamazight''?
![samir_t](/img/profiles_36/80480.png?1639767099)
Yes, here are examples:
https://tatoeba.org/fr/sentences/show/12559102
https://tatoeba.org/fr/sentences/show/12559103
Look at the Kabyle translation of each sentence in English, then enter it into Google translations for translation into "Tamazight", and compare: it gives the same Kabyle sentence. It actually only translates into Kabyle even though the name of the language is "Tamazight".
![Warwari](/img/profiles_36/52ed7cfaffdccf620337d14925bd630f.png?1542801944)
Ce que vient de faire Google est une révolution pour notre langue amazighe. Le projet est ouvert à tout le monde, y compris les Marocains. D'ailleurs plusieurs amazighophones marocains y ont participé (et participent ici sur Tatoeba) et tout le monde est le bien venu pour participer au développement de la traduction automatique amazighe aussi bien sur Tatoeba. Tanemmirt (merci), doemaar14 pour ton message de félicitation.
The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.
![CK](/img/profiles_36/1396.png?1619855846)
🍎 Screenshots 2014 to 2024
All times are in GMT+9 (Japan Time)
Most of these are from July 4th.
Most of these are of the "Number of sentences per language" page.
► Number of sentences per language at 2024-07-04 08:58
https://imgur.com/a/1gv0Zt2
12,125,173
► Number of sentences per language 2023-07-05 at 8:31
https://imgur.com/a/zTbcNPJ
11,484,149
► Number of sentences per language 2022-07-04 16:16
https://imgur.com/a/AAAeYHH
10,542,864
► Number of sentences per language 2021-07-04 at 10:59
https://imgur.com/a/sVQtM2S
9,733,562
► Number of sentences per language 2020-07-04 at 12:14
https://imgur.com/a/FQ2naCf
8,487,237
► Number of sentences per language 2019-07-11 at 11:28
https://imgur.com/a/JylbPlA
7,653,357
► Number of sentences per language 2018-07-04 at 9:52
https://imgur.com/a/suukPes
6,603,517
► Number of sentences per language 2017-10-12 at 12:01
https://imgur.com/a/FpJT9CN
6,021,256
► Number of sentences with audio by language 2016-10-20 at 11:09
https://imgur.com/a/thn6WPj
The only screenshot I have from 2016 is for the audio.
► Number of sentences with audio by language 2015-01-22 at 20:18
https://imgur.com/a/bfL97Fm
The only screenshot I have from 2015 is for the audio.
► Number of sentences per language 2014-07-11 at 22:56
https://imgur.com/a/7amGeSw
3,244,250
The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.