Wall (6,767 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
2 days ago
2 days ago
2 days ago
Could someone tell @amastan to stop adding so many sentences about Algeria?
14,031 occurrences in English-language sentences, compared to:
'France', 1,001 occurrences,
'China' , 1,296 ,
'America', 1,098 ,
'United States', 1,112
All these countries have bigger populations, and are, dare I say, significantly more recognizable than Algeria.
(While ''French'' does have 13k+ occurrences, almost all of them are only about the language (as opposed to the culture/country), which is spoken by about 100 million people world-wide. ''Spanish'', with 486 million speakers yields only 870 results,)
He also seems obsessed with transgender people.
Does being obsessed with something go against Tatoeba's rules?
Excuse me, I was just passing by, but does writing sentences about countries other than “recognized” really go against the rules of this website?
As @Nuel pointed out, Tatoeba's own guideline: ''Avoid using the same words, names, topics or patterns over and over again.''
And it's definitely not coincidence: almost all of these 14,000 sentences are from one single user.
Mere happenstance could never result in that many sentences about Algeria. This project should be representative of the entire world, or at the very least of places most people will be familiar with or actually live in (India, USA, China).
Translating on here gets boring/annoying pretty fast when every other sentence is about Algeria, not to mention the fact you have to filter out sentences containing ''Algeria'' each time you download sentences. We could start spamming different countries as a knee-jerk reaction, but that'd be just as bad.
How Algeria is worse than Tom and Mary?
Everyone here has the same means of contacting Amastan, which is to send a private message. So you can in fact tell him to stop adding so many sentences about Algeria, as can anyone else. Doing so directly in private is also more likely to reach the intended recipient than posting on the wall. (I'll send him a link to this wall post so he has the opportunity to respond.)
But if you're actually hoping for an admin to order Amastan to stop, that's unlikely to happen, since the last time a similar issue came up the verdict was that
> As a general rule, no action will be taken against a contributor based on the sole fact that they are creating new sentences with a name that has been overused.
If you want to avoid English sentences which overuse certain words, you might want to restrict your searches to lbdx's "Pruned English Corpus" list, i.e. https://tatoeba.org/en/sentence...ny&sort=random
It's also worth pondering what Amastan is supposed to do instead. After all, telling someone to stop doing something they think is the right thing to do is unlikely to be effective, but convincing them that there's something else they could do that would be even better might work.
As far as I know, Amastan is Algerian, so it is indeed not mere happenstance that he added so many sentences about Algeria, but also not terribly surprising. However, it does seem like many of these sentences could be equally said about pretty much any country, which might be why you consider them boring.
But surely Amastan doesn't want us to think Algeria is boring! So maybe there is room for mutually beneficial cooperation here. Do you think some of the sentences are less boring than others? For example, I think that sentences that are less generic and more specifically about Algeria, like #7811221 "Algeria and Morocco are the only North African nations that recognize Berber as an official language." are a bit more interesting. What do you think?
Maybe Amastan would also be up for creating more sentences that are about Algeria without explicitly stating as much. (In keeping with the old adage that writers should "show, not tell.")
If everyone stop adding almost useless and empty short sentences with the same words over and over again, perhaps, Amastan will also stop acting so.
I think that may be his reaction against Tom- and/or Australia-sentences.
(I still think these sentences are the first reason that the most contributors finish using Tatoeba after translating three (or much more) sentences.)
(It may be just a double standard against Amastan. Am I wrong?)
> It may be just a double standard against Amastan. Am I wrong?
You're right, Amastan isn't the first to stuff his sentences with pervasive words. But he's by far the one who's been producing the most of them lately. Last month, he added 16,000 original English sentences. The other two main English contributors added only 1,000. The figures are available at https://colab.research.google.c...=5&uniqifier=1
Unfortunately, the people who seem to be in charge of the project refuse to see this as a problem. How about a monthly cap on the number of original sentences in one language? 3,000 seems like a reasonable number to me...
As this issue comes up regularly on the wall, I would like the Tatoeba community to comment on this measure. If you think such a cap is appropriate, please post a plus sign in the comments of this post. If you want to show your disapproval, please post a minus.
When the corpus is domated by Western names Tom and Mary, everyone is OK with that. But when a non-Western place name Algeria gets used, people want some caps.
This is cultural colonialism, plain and simple.
You're wrong. "Everyone" is not OK with Tom and Mary ad infinitum.
I still remember when a certain member here was uploading tens of thousands of sentences in one go once a month (using scripts), which is partly why we're lumbered now with Tom and Mary. Above all, it's that sort of behaviour, now being exhibited by another user, that I'd like to see some "caps" on. I wish it had been done years ago.
Well, if that gets instituted after Algeria and not after Tom and Mary, that still shows the bias of the community.
Maybe. But you're still mischaracterising people here. We're not all the same.
You're right, I guess.
That said, though, I personally don't like some users' use of this site to, as they see it, push their agenda.
However, my point in supporting a cap is about *volume*. As I said, I wish it had been done years before this latest user started the deluge. Tatoeba seems to be at the mercy of anyone motivated enough to dominate its contents.
> some users' use of this site to, as they see it, push their agenda
I don't believe in "not having an agenda". We all have our views, and the sentences we add necessarily reflect our views. Everyone has an agenda.
We all have views. That's stating the obvious.
There is a difference between "having views" and "having an agenda". Anyway, if you will: my "agenda" is that the only accepted agenda on Tatoeba should be the passion for creating (collecting, in some cases) high-quality linguistic content across the planet.
What it looks like to me is that there are two separate issues, and that they've been conflated. I see these two issues as:
1. A single person is adding far more sentences to the English corpus than the next two people combined.
2. People don't like that a lot of this person's sentences are about Algeria. That's too bad. More people should add sentences about their countries and cultures. He's doing nothing wrong by writing about his own, even if a lot of it amounts to political propaganda.
To address 1, I'll echo those who have suggested caps. I think the caps suggested are more than reasonable and would not prevent most users from adding the number of sentences that they are already adding.
I don't object to sentences about any country, just as I don't object to sentences about any city. As far I'm concerned, some of the best English contributions here are written by non-native speakers. They put to shame my own attempts to write in other languages. I take my hat off to those users. I'm not one of those who discourage non-native speakers from adding English sentences.
What I do object to is sentences being uploaded to the site on an industrial scale. The main priority seems to be to pump out as many sentences as possible – to what purpose, we can only guess – and let the rest of us find and correct the mistakes (a service we provide voluntarily). We all make mistakes, but in this case it's the sheer volume. Whoever you are here – native speaker or non-native speaker, admin, corpus maintainer or whatever – taking this approach is not community-minded.
Did you still have a tab open from three days ago? I got rid of the native part of that post within an hour of making it (and my original comment said non-native contributions were fine, but that the volume of non-native contributions was the problem)
At any rate, the answer seems to be a cap on daily contributions.
There are currently 17,183 occurrences of the "wildcard" country "Australia" in the corpus, and 14,651 occurrences of "Algeria." Australia is the best thing to compare Algeria to, rather than Tom and Mary, which are names (and Amastan uses many names in his sentences).
If the cap is instituted once the number of occurrences of Algeria comes to roughly equal that of Australia, will that allay your concerns? It will mean both Algeria and Australia have equally benefited from the pre-cap situation. No one could say that Algeria has been disadvantaged; in fact, one could say that instituting this cap at any point (even now) makes it harder for any country to play catch-up to Algeria.
Disclaimer: I am not an administrator and do not have the power to make offers.
It's not the first talk about wildcard words.
The point is Algeria is already almost as well represented in the corpus as Australia - better represented even, proportionally to population. And that it will soon catch up to Australia in raw numbers.
Before Amastan was adding 16,000 new sentences in a month, someone else apparently used to add a similarly high amount, but their contributions were never capped. If we did the cap after Algeria caught up to Australia, you could say that both Algeria and Australia had benefitted equally from the situation before the cap was instituted. I don't know if this is less prejudicial or not; it's an idea.
We were talking about a cap for years now. When finally we will have a cap, that long list will contain 160 thousand sentences, not just 16.
A cap is a limit. I'm talking about a limit on the number of daily contributions.
16,000 sentences by one user in a day is too many. It's not natural. I'm disabled and I don't contribute anywhere near that much, nor has anyone contributed anywhere near that much before without using scripts.
Assuming 16 hour workday, that is 1000 sentences per hour, or 17 per minute, or one per 4 seconds. Just for reference. Or one per 2 seconds if only working 8 hours on Tatoeba that day.
My apologies. It's 16,000 a month, not 16,000 a day. I misremembered what I read.
Either way, that is 16 times as much as the people who add the second-most and third-most sentences. That's something we have good reason to want to prevent, for the sake of the future of this site.
But having this discussion in the midst of a discussion asking to limit sentences about Algeria specifically taints this.
About 500 a day is much more manageable, even by hand. Thank you for correcting.
What 16 thousand sentences a day? Where? I don't see such day.
I misremembered - it was 16,000 a month.
However, I don't like that great amount of sentences with the word "Algeria" or with any other country name, either.
> I still think these sentences are the first reason that the most contributors finish using Tatoeba after translating three (or much more) sentences.
What is the presumption and what is your statement? Do we have the means to talk about worrying tendencies within contributions to Tatoeba? If so, do you have any evidence supporting this idea that trivial sentences, even in excessive amounts, somehow pose a threat to Tatoeba?
My personal impression is that it is *easy* to find useful sentences and a whole lot of mental cycles are wasted, with little constructivity, over something that isn't even consensual within the community. I can easily agree with a sort of "rate limit" on contributions as a broader measure against bruteforce takeover.
Last but not least... I have no intention to rank the reasons why somebody stops using Tatoeba but you yourself are a dubious example in my opinion, and you should at least think about that before coming up with theories. You seem to fancy far-fetched or downright undecipherable translations and you are very vocal not only against near-literal translations but upon your personal ad-hoc prescriptivism as well.
Since I was rather involved in Tatoeba via Clozemaster, I definitely see a bigger problem in coming across "maasterisms" or your annoying insistence to change translations that I consider plausible and practical because they are apparently not idiomatic enough for you.
That isn't just a presumption.
Clozemaster is not my business.
It's a voluntary job. I don't do it for other ones' pleasure.
Add much more translations on Tatoeba and read them on Clozemaster.
Perhaps, there're not enough cynical sentences on Tatoeba. You can add some.
If it's not a presumption, then maybe argue for it, or show evidence.
Also, it's kind of a strawman to talk about someone's pleasure when I said you are actively causing DISpleasure, both to users with your trademark undecipherable sentences, and to contributors with the trademark prescriptivist gatekeeping over absolutely non-representative personal fixations. This is something you ought to think about, not to deflect.
Using clozemaster after the change (now you only can do 30 sentences a day per language (as a free user)), my biggest concern no "maasterism", no "amastanism", but the mood what the sentences create, there are soooo many bad view about the world, sooo many pessimistic sentences, sooo many about death and dying.
I know, I contributed to those. But in the big picture, there are too much of them. It makes me sad. It's a sad collection about sad sentences.
Our creation is what sadeness, anger was inside us.
People probably stop using the site because they are not full of hatred or sadeness.
I also stopped using Clozemaster after that ridiculous restriction but honestly, I can't recall the sentences to be particularly bitter or depressive.
Anyway, it doesn't literally have to be Clozemaster. If you want to learn about a language by taking sentences and their translations, it's an unwanted challenge that somebody makes up odd artsy translations on a regular basis, while also telling others off for translating too literally, or using certain colloquialisms that go against his peculiar view of language protectionism. I have more confidence in my language use than to fall victim for the latter attempts but I think even the attempt is completely off for a project like Tatoeba. And the translations that barely resemble the original sentence have dubious value, the least to say.
It became small pocets of sadeness. When you on a daily basis read something about death, it seems depressive. (not only death, but suicide, suicidal thoughts, marital misbehavement... etc. (when I did 400 sentences a day I didn't recognize such thing, but this way I picked up on it)
I wrote "I think".
Nevertheless, if you had been here in 2016, you could have read supposedly the last comment and opinion of freddy1 about "empty" sentences–before leaving the project.
> Users who participated in the last 200 contributions.
85% of the last 200, he's adding a sentence every 6 seconds, all of them are "Algeria" or "Antonio"…
To me the answer to "disproportionate number of sentences about Algeria" is to add more sentences about France, India, China, the U.S., Japan, Germany, Kenya, etc..
Granted, this is hard when someone adds such a volume of sentences that it dwarfs all others, so caps are a good idea
Aluksi Tatoebassa oli paljon Japania koskevia lauseita.
Jossain vaiheessa alkoi Tom- ja Mary-aalto, samoin kuin ranskan kieli.
Samoin Ziri, Mennad, Sami ja mitä näitä nyt on.
Nyt sitten Algeria.
Toisaalta: tämäkin menee ohi ja uusia aaltoja tulee.
Toisaalta: ilmeisesti tämä aktiivisesti ärsyttää ihmisiä.
✹✹ Stats & Graphs ✹✹
Tatoeba Stats, Graphs & Charts have been updated:
Thanks, as always.
Is it okay to write 600 sentences in 56 minutes? ("That's a big number, how can I do that" - some asking.) You just copy and paste sentences and write in them all the verb forms...>> (sorry English, Swedish...)
(And, is it useful?) A bit yes but a big no, because we use some conjugated forms of a verb more often, and some conjugated forms just 'sound stupid' (as we Hungarians say) in the same setting.
>> ... or use a bot. (Do not use bots, it's prohibited.)
If you want mass import, contact with the admins.
Lauseen kääntäminen monella eri tavalla on arvokasta.
Monien samankaltaisten lauseiden lisääminen on arvokkaampaa kuin vain yhden lauseen lisääminen, mutta vähemmän arvokasta kuin monien erilaisten lauseiden lisääminen.
Niinpä, jos lisäät uusia lauseita, lisää mieluummin keskenään erilaisia kuin keskenään samanlaisia.
This person also uses bots. Block the profile.
@Cabo this profile is not a bot
The contributors on the kab corpora are identified. We are a team working together on diffrent types of kab sentences. We are creating and sharing content before putting them on Tatoeba. Along with Tatoeba, we are also working on diffrent open projects.
"@Cabo this profile is not a bot"
Where did I write, the profil is a bot?
From my message:"This person also uses bots."
He/she used bots ... or it used bots?
If you really a computer engineer, you need to know the difference between informations and data.
"We are creating and sharing content before putting them on Tatoeba. "
Then leave there, don't put your "content" on Tatoeba, it is not the right place for those data.
@Cabo You seem very angry. Please let this space quiet. I leave the discussion.
Near-duplicate generating, bot using, attempts to pollute the corpus in other languages, like Spanish, Hungarian.
This place was quiet before these things.
Actually, seems like you can do it, you can write 16000 thousand sentences a day without any consequences, so get into it!
There are now over 900,000 sentences on my list of proofread English sentences.
About 85% of these have audio files.
Many of are yours, and sometimes you miss your own mistakes. That's why I would better see sentences proofread by not the owner.
🍎 I made a video with 100 recently-added sentences.
English Sentences 021 from the Tatoeba Corpus
9 minutes - 100 Listen and Repeat Sentences
If you want to translate these sentences, you can also find all these sentences on this list.
🍎 Tatoeba.org Native Speakers with Native Language Sentences
Find native speakers of languages you are studying and get links to their native language sentences.
I also created a cut-down version that may work better on devices that can't handle the full version.
I cut out all the lines for contributors with less than 50 native-speaker contributions.
🍎 Stats - 2023-05-20 - Numbers of Native-speaker Sentences & Native-speaker Contributors
Is there a way to do a search for X and bring back results that do NOT contain Y?
I don't want results that contain the sub-string "ihtiyacın". Is that possible?
I think putting a minus before the word does the trick, or do you mean not containing a word in the translation?
The minus sign before the word works only if the unwanted word is in the same language as the "from" language. I should revise my question:
Is there a way to do a search for X in language P and bring back results in language Q that do not contain Y in language Q?
In my example, I am searching for "if you need" (quotes included) from English to Turkish, but I don't want any results that contain the word "ihtiyacın" which is a Turkish word.
I don’t think it’s possible to do that within Tatoeba, but, in theory, it should be possible to write a script to do that, or to use the API and a text editor to do that.
Sadly it's not possible to do this at the moment. It has been mentioned several times in the past. The issue is being tracked here: https://github.com/Tatoeba/tatoeba2/issues/1576
Code or design contributions are welcome!
Have you tried asking on Stack Overflow? I know nothing about that technology...
The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.
✹✹ Stats & Graphs ✹✹
Tatoeba Stats, Graphs & Charts have been updated:
🍎 Milestone: 777,777
On the right side of this page, at the top of "Number of sentences with audio by language," you can see 777,777 for English.
2023-05-12 2:30 a.m. UTC
If you would like to translate some of these into your own language, try the following page.
Dashboard for Translating English Sentences with Audio
If you use the search tool:
you get a different number: 777,243
I wonder why
That bug was first reported in 2018.
777,777 on https://tatoeba.org/en/audio/index/eng
I paused uploading new audio files long enough to grab this screenshot. 😀