Wall (6,933 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
21 hours ago
8 days ago
8 days ago
9 days ago
10 days ago
10 days ago
10 days ago
10 days ago
10 days ago
11 days ago
🍎 Are you looking for English sentences to translate into your own language?
Here are some sentences that have audio that do not yet have translations into any language.
ddnktr's Sentences (732)
shekitten's Sentences (328)
Miktsoanit's Sentences (140)
AlanF_US's Sentences (30)
sundown's Sentences (8)
** Pruned/Rebalanced Lists **
Rebalanced lists are lexical filters that provide a more varied and balanced view of the Tatoeba Corpus. They prohibit a word from occurring more than 10 times as often as in a reference corpus. Long sentences of more than 15 words have little success with translators and are therefore systematically pruned. The most recent sentences are pruned before older ones. The words targeted are usually pervasive named entities that are used extensively by a few Tatoebans, and not relevant across languages.
10 major languages on Tatoeba are currently supported:
- English: https://tatoeba.org/en/sentence...=1&orphans=any
- French: https://tatoeba.org/en/sentence...=1&orphans=any
- German: https://tatoeba.org/en/sentence...=1&orphans=any
- Italian: https://tatoeba.org/en/sentence...=1&orphans=any
- Japanese: https://tatoeba.org/en/sentence...=1&orphans=any
- Mandarin Chinese: https://tatoeba.org/en/sentence...=1&orphans=any
- Portuguese: https://tatoeba.org/en/sentence...=1&orphans=any
- Russian: https://tatoeba.org/en/sentence...=1&orphans=any
- Spanish: https://tatoeba.org/en/sentence...=1&orphans=any
- Turkish: https://tatoeba.org/en/sentence...=1&orphans=any
All rebalanced lists are updated automatically every Saturday.
@lbdx Reading your description of the list in your profile, it's interesting to me that you date the imbalance of the English corpus to 2017: that's when I joined. Sharptooth's graphs show a massive increase of English sentences at that time. Until 2020 (or thereabouts), I myself had only added about 1,500 sentences.
The years 2017 and 2018 were years in which Tatoeba's main English-speaking contributor added hundreds of thousands of sentences in bulk.These sentences were mostly built according to syntactic patterns and used wildcards to avoid creating paraphrases that differ only in their named entities. These massive additions have greatly reduced the lexical diversity of the English corpus and increased the proportion of sentences containing pervasive words from 20% to 40%. This sudden change coincides with a sharp drop in the number of active contributors to Tatoeba.
The introduction of rate limits for sentence additions would prevent such a flood from happening again.
Thanks, @lbdx. It's good to have some numbers to back up what should be obvious to anyone who cares to look a bit at the English corpus and who has shaped it. Could you give us some more detail about the sharp drop in the number of active contributors? For example, has it been across all languages and countries?
The number of monthly sentence owners fell from 350-400 between 2012 and 2016 to 250-300 between 2017 and 2023. I don't have the details by language.
Do you have any recommendation on what might be a good rate limit?
I did find a post where you suggested 3000 original sentences per month in one language:
Just wondering if you still think it's a good rate limit, or do you perhaps have another opinion now?
Any particular reason why you suggested a monthly rather than a daily or weekly cap?
If I may share some additional insight, perhaps some people don't know, but Tatoeba used to have a mass import feature. Only admins could access it, but you could send a list of sentences to an admin and ask for them to be imported. The feature was disabled on January 2019 because we migrated CakePHP from v2 to v3, and we didn't feel it was urgent to migrate the mass import feature. It was for the best, I guess. I would say this feature was the main cause for the reduced lexical diversity of the English corpus.
Trang, thank you for reopening the debate on this important issue.
My view on this has evolved slightly. I now think it would be simpler and more understandable to also include derived sentences in this rate limit of 3,000 sentences per language per month. Sentence counts would be reset at the beginning of each month. Once the limit has been reached, the user would not be allowed to add any more sentences until the following month. I prefer a monthly rate limit because it doesn't penalise users who don't contribute every day or every week.
Note that I'm not against the occasional import of other corpora into Tatoeba as long as they are lexically balanced and composed of sentences that are useful for language learners.
I don't see the problem with a native speaker contributing useful sentences in great volumes; I'd rather have that than non-native speakers contributing questionable sentences, even in small volumes. But then, I don't see a big problem with Tom and Mary either.
I used to be a proponent of limits, but the more I think about it, I think any workable limits would have to be set so high as to make them more or less meaningless.
> I don't see the problem with a native speaker contributing useful sentences in great volumes; I'd rather have that than non-native speakers contributing questionable sentences, even in small volumes.
This is too simple. A native speaker contributing 1000 auto-generated sentences isn't necessarily more valuable than a non-native speaker contributing one correct sentence.
Obviously, we don't want any auto-generated content at all. But even with limits per hour, day, week and month, or a combination thereof, it's trivial to adjust a script to conform with that, and still be able to upload thousands of sentences we don't want.
Yes, an automated script can be slowed down arbitrarily to conform with any given limit, until it no longer has a quantitative advantage over someone adding sentences manually.
Uploading thousands of sentences we don't want is only possible if you're able to upload thousands of sentences in the first place.
Is there a more user-friendly way to perform an advanced search, restricted to a list?
In advanced search https://tatoeba.org/en/sentences/advanced_search, there is a drop down menu "Belong to List" by which one can choose a specific List. But this drop down menu is very cumbersome and choosing a specific list is very difficult.
For example if one likes to restrict the advanced search within the List "Spread by Tatoebans" one has to scroll the menu hundreds of time to arrive to "Spread by Tatoebans".
(It would be easier if this menu could be partially searched to find a specific list more quickly, for example typing "Spread" lists all lists containing the word "Spread" and then choosing the proper one.)
For me, using Google Chrome on a Mac, I can just click the select option and start typing and it jumps to that list.
Remember that you can also save a template, then bookmark it.
Here is a template with that list selected.
You can make additional presets for your searches, too, in templates.
Has audio: Yes
List: Spread by Tatoebans
Without entering any search query, you can just click the "Search" button to get a random selection of sentences.
If you are looking for English sentences to translate into Persian with the above criteria, add the "Exclude sentences already translated into Persian" part.
Here is that template already created for you.
Thank you for the templates. You are right on pc, typing just works. On the other hand on touch screen devices the keyboard is not opened and the only way seems, scrolling the menu.
Over the years I have witnessed, as many of you undoubtedly have, too, the hard work done and being done by all the Tamazight contributors on Tatoeba, and I commend them for it, ...but I have to ask this question:
Where in the real world can we actually find written content in Standard Berber/Tamazight?
I don't mean websites that just explain the grammar, but rather: monolingual websites that are constantly updated, textbooks, novels (fiction), wikipedia articles, news websites, blogs, real Berber-language content in the wild. That sort of thing.
(Of course the Berber dialects do have a rich history of spoken content, especially when it comes to music and movies, much of which can be found on YouTube).
I also know that some of you have tried to get Wikipedia to accept the language code for Berber, but they declined the request, didn't they? It seems to me the language merely exists as a spoken one.
I take it the Berber contributors have no answer to my question?
If so, what is the point of adding all these thousands of sentences, when people interested in these languages can't even find real-world, frequently updated content in said languages? I have scoured the web and found nothing: almost nothing in the way of fiction, news websites, science, technology, just zilch, except for a Bible translation and some random PDFs posted on French websites.
> What’s the point?
Berber is a familly of langages, we re raising it since several years now to tatoeba
And it is why you have a kabyle language for example in tatoeba, which is a real language with rules, courses, Books, articles, websites, songs, poems etc
Most of the ber sentences are a kabyle ones
The flag also is a berber one, we asked the admins to change the kabyle ones with the right kabylian flag but they refused and put the ber one...
That makes sense. ''Berber'' is a unified, artificial language, after all.
To prove my assumptions wrong, could you point to any frequently updated websites that exist exclusively in Kabyle?
There are several sites devoted to the Kabyle language, starting with Wikipedia. Almost 7000 articles.
Wonderful. Kabyle in the real world! Tanemmirt! Hopefully the amount of Kabyle content keeps growing.
Tanemmirt. It's very kind of you.
there are several kabyles websites having articles in Kabylian language and in french, like:
According to linguists, Berber is not a language but a group of languages . Consequently "ber" is an ISO 639-5 language code but not an ISO 639-3 language code. That is probably why Berber has been declined by Wikipedia.
Tatoeba also does not accept languages that do not have an ISO 639-3 code, but an exception was made for Berber. In hindsight, this was probably not a good idea. It creates overlap and harmful competition with other Berber languages' corpora such as Kabyle.
I've also been wondering. Why is there such a large Kabyle speaking community on Tatoeba? I haven't seen Kabyle anywhere else on the internet really, even on linguistic-based websites. What about Tatoeba has such a draw for Kabyle speakers? (The same goes for other Berber languages, but Kabyle is the one I see the most.)
Tatoeba on vapaaehtoisuuteen perustuva verkkosivu. Käytännössä, jos joku innostuu siitä ja värvää aktiivisesti muita, voi syntyä hyvä kierre jossa kyseinen kieli tai kulttuuri tuntuu yliedustetulta. Vastaavasti moni muu kieli voi olla raskaasti aliedustettuna, koska kukaan ei ole vain sattunut innostumaan verkkosivusta tai innostus on hiipunut.
Mutta tämä on aivan luonnollista, koska osallistujamäärät ovat pieniä. Tällöin satunnaisvaihtelu on suhteessa suurta.
You've created the profil very recently.
They found this place to conquer while ppl let it to happen.
Ei Tatoebaa voi vallata. Toisen kielen lauseet eivät vaikuta elämääsi mitenkään, jos et halua.
On the Advanced Search page, there's an option that allows you to limit results to a single user. That allows you to only see sentences that belong to your friends.
I know what the search query can do.
And you said false information.
"...limit results to a single user."
as you said, to a SINGLE USER
"That allows you to only see sentences that belong to your friends. "
That doesn't allow me to see ONLY THOSE sentences.
And who talked about a single user???
I talked about a whole group of users.
And what more, ignoring something what made the corpus less balanced, as Ibdx said: https://tatoeba.org/en/wall/sho...#message_40485
"The years 2017 and 2018 were years in which Tatoeba's main English-speaking contributor added hundreds of thousands of sentences in bulk.These sentences were mostly built according to syntactic patterns and used wildcards to avoid creating paraphrases that differ only in their named entities. These massive additions have greatly reduced the lexical diversity of the English corpus and increased the proportion of sentences containing pervasive words from 20% to 40%. This sudden change coincides with a sharp drop in the number of active contributors to Tatoeba."
... doesn't help.
You can do multiple searches if you want multiple users; and Tatoeba could improve the search engine in the future to allow restricting results to given groups of users or to allow excluding groups of users. Don’t you usually translate Pfirsichbaeumchen’s sentences?
I think any imbalance in the corpus can be fixed with improved search features and using lists. There’s room for everyone.
> I think any imbalance in the corpus can be fixed with improved search features and using lists. There’s room for everyone.
The idea that improved search will sort out the imbalance in the corpus I find overoptimistic, to say the least.
What about Ibdx’s solution, or hand-crafted lists?
I agree with lbdx that a sentence limit should be introduced. It's the least that should be done.
If you don't want to see it, it's still there.
Well, just wear blinders then.
"Don’t you usually translate Pfirsichbaeumchen’s sentences?"
Huh? So you think I'm maaster now. I'm Cabo, it's written on the top of the message block. And what if someone has a dedicated user whose sentences he/she likes to translate? Does he/she have no right for opinion? Just translate and care for nothing else?
And thank for the one who blocked my message?
Was it the word shite? Or just not happy what I had to say?
> So you think I'm maaster now?
I did confuse you with maaster.
> And what if someone has a dedicated user whose sentences he/she likes to translate?
I wasn’t criticizing maaster. I was just saying that you can already restrict the sentences you see to those belonging to any one user you want.
✹✹ Stats & Graphs ✹✹
Tatoeba Stats, Graphs & Charts have been updated:
🍎 Stats : An attempt at counting the number of active contributing usernames per year
2024 (268) * in progress
2017 (1414) * the year the SSD died. https://blog.tatoeba.org/2017/
2013 (2289) * the peak
Here are the number of usernames owning sentences without valid dates (early entries in the database).
This is based on data harvested from th 2024-02-10 sentences_detailed.csv file.
Note that this is not actually the number of active usernames from each year.
The usernames counted are the ones who currently "own" sentences added in those years, not really the contributors, since "orphan" sentences can be adopted. This means that the year the Tanaka Corpus sentences were imported shows a lot more contributors than there actually were. Many of those who have adopted these sentences joined the project much later.
If you want to count contributors active in a given year, you should probably analyze the contributions.csv file instead.
You've raised a very good question. By the way, the Berber language as defined in Tatoeba is a catch-all, because 80% of its phrases are Kabyle phrases but with a mixture of other Berber words. This mixture is not based on any linguistic reality, only ideology. The "Kabyle Berberists" who created this ideology think that by imposing their Kabyle language and mixing it with 20% of other languages, they will be able to unite all Berbers.
By the way, everything you can find in the way of novels, poetry, theatre, websites, music... It's 70% Kabyle. Because the Kabyle language has been transcribed into Latin since the 18th century. And there are 12 million Kabyles. Even the flag that Tatoeba's admins have imposed on the Kabyle language is not the right one, but rather the flag of all Berbers. The Kabyle flag has been removed from Tatoeba, since that. So the ideology feeds on another ideology. But the Kabyle language will progress, that's for sure. You only have to look at the digital fields in which the Kabyle language is used for localisation, learning and so on.
There are several sites devoted to the Kabyle language, starting with Wikipedia.