Wall (6,079 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
17 hours ago
23 hours ago
2 days ago
** Annotation: **
You may have noticed that a number of us have been using "Annotation:" as a keyword in comments, so that if we ever get more than one kind of comment, someone can quickly grab all of these and put them into an "annotation" comment type.
I grabbed all such comments from last Saturday's exported data if you are interested in looking through what we have so far.
This is a large page, so be patient when it's downloading. Perhaps some less powerful devices will not be able to properly display this.
Related GitHub Issue: https://github.com/Tatoeba/tatoeba2/issues/1830
** Stats & Graphs **
Tatoeba Stats, Graphs & Charts have been updated:
Here are a few numbers taken from sharptoothed's stats, sorted a different way, inspired by a recent discussion on the Wall.
Stats 2021-01-16 - Languages Sorted by the Number of Newly Translated Sentences
Many thanks @CK.
Thank you too @Sharptoothed
Aux nouveaux contributeurs de langue kabyle, j'aimerais attirer leur attention sur deux choses :
- Ne pas copier des proverbes et autres phrases à partir du web, beaucoup sont mal écrits et parfois le règlement l'interdit à cause des droits d'auteur.
- S'il s'agit d'utiliser des mots clairement tirés des autres langues berbères et qui ne sont pas vraiment utilisés en kabyle, mieux vaut participer dans le corpus berbère, il est là pour ça.
Il faudrait aussi leur dire de ne pas ajouter des mots ou phrases incomplètes sans contexte comme
Ou des phrases qui ne semblent pas être claires même pour eux et qui sont après traduites mot par mot comme par exemple
Rien à rajouter car c'est clair comme l'eau de roche.
Tatoeba will soon be the collection of untranslated sentences.
Nice collective work.
Add own interesting sentences and translate sentences of others as well.
The current percentage of untranslated sentences is 13.29%, so we're still quite far off from them becoming the majority. Your advice is still good, of course. Wehret den Anfängen! #1972832
Top 10 languages by percentage untranslated:
Top 10 languages by number of untranslated sentences:
A high proportion of untranslated sentences is neither an entirely good nor an entirely bad thing. It depends on the way the sentences are contributed.
If the collection of original sentences were assembled in a top-down manner, with a high degree of collaboration, in order to achieve diversity and extensive coverage of vocabulary, grammatical features, degrees of formality, and so on, then I would say a large number of untranslated sentences would be a bad thing, because it would lead to gaps. However, if everyone contributing original sentences were just competing to write as many as possible, with little variation, then a high proportion of untranslated sentences could be a good thing, because it could mean that people were deciding not to translate near-duplicates.
Our current Tatoeba corpus is a lot closer to the second extreme than I would like. When I do a search to find how a particular word is used, it is often possible to find at least one sentence that exercises it, but I usually have to page through many sentences with only superficial differences, while important senses of the word go uncovered. And of course we have vocabulary that is not yet covered at all.
Given the fact that many sentences are near-duplicates add little to the collection beyond what has already been added, I'm glad to see that many of them are not translated. People should not just seek out the sentences that are easiest to translate, but the ones that are most worthy given what has already been translated. And as maaster says, they should also add interesting sentences of their own.
I think the answer to this problem (or non-problem) is the addition of a semi-public API for external translation outsourcing.
I see many people like myself wanting to develop third-party apps that could help shorten the gap.
Translation is one of Tatoeba's core missions. Unless I'm misunderstanding what you mean by "outsourcing", I don't understand why we would want to outsource it. If people want to join our effort, they should join our community. Similarly, an attempt to make it easier from a technical point of view to participate in Tatoeba tasks should start by determining what can be done inside Tatoeba.
Note that sharptoothed's stats can be sorted.
Here they are sorted by "translated"
Sorted by "translations"
Sorted by "rated"
Just click any of the column titles that say "total." Click the title again to reverse the sort. You can click the other column titles, too.
> languages by percentage untranslated:
> Top 10 languages by number of untranslated sentences:
A considerable amount of those sentences is not only untranslated, but also near-duplicates.
One account for example, seems to be created to systematically produce untranslated, near-duplicate sentences with different place names. There are more than 60 thousand sentences.
The pool of names also seems to be widening. There were 32 sentences with the pattern below at the time of the discussion.
There are more than 70 now.
Here's another striking pattern with nearly 15 thousand sentences.
They roughly mean: "How do you say/write/translate X (word or phrase) in/into Y (language name)?"
With a large word list and a language name list, one can easily produce countless sentences in any language.
There are probably other patterns too. Since Google Translate doesn't support that language, and they're being added by multiple accounts in a shuffled manner, it's not very easy to notice.
Still, I congratulate them for their Herculean efforts.
You should probably remove that last imgur link. It detracts from your otherwise good arguments.
One of the links that you provided points to a very long thread. If you step through the first 42 comments (and I can understand why many people would not), you find a comment explaining that these sentences are being collected to generate a Kabyle voice assistant model for an open-source GPS:
So there is a specific purpose for them. By the way, it strikes me that giving these sentences a specific tag would make it easier for people to exclude them from searches where they're not desired.
> these sentences are being collected to generate a Kabyle voice assistant model for an open-source GPS
I see. This can be an excuse for every language, but once a place's name is recorded, I guess the same audio can be used with many different patterns. That's how it is even in some subway announcements. Using each name with every pattern and recording audio for each name hundred of times don't seem to me the most efficient way to achieve that. Also, one would expect to see the major cities in a GPS program, but that pool doesn't even include the capital. Place names aren't the only source of near-duplicates anyway. My recommendation is that the rankings on the stats page should be based on translated sentence counts and not on the total numbers of sentences. The latter can still be placed in parentheses next to the former without affecting the rankings. It may have some drawbacks and seem a bit childish, but I believe that simple change will help improving the quality of the corpus by discouraging users from mass-producing sentences with same patterns just to boost their language to the top of the rankings. If they are really doing it for some other projects like GPS, NLP, space program or whatever, then that change shouldn't affect their motivation.
Berber is officially recognized in Algeria and this is not at the expense of its own dialects like Kabyle and Shawi.
This is what the 2020 Algerian constitution states about the Amazigh/Tamazight/Berber language:
Tamazight shall also be a national and an official language.
THE STATE SHALL ENDEAVOUR TO PROMOTE AND DEVELOP IT IN ALL ITS LINGUISTIC VARIETIES THROUGHOUT THE NATIONAL TERRITORY.
An Algerian academy for the Tamazight language shall be established under the authority of the
President of the Republic.
It shall be supported by the work of the experts and assigned the task of providing the necessary requirements to develop the Tamazight language in order to integrate it as an official language in the future.
The modalities of implementing this Article shall be stipulated by an organic law
Therefore, Kabyle, as a variety of the Amazigh language, is part of this endeavor.
> So there is a specific purpose for them.
There is a specific purpose for everything. If there is a specific purpose, is it okay to transform Tatoeba into yet another soulless infinitely-declined compilation of sentences for machine learning? I can't see how tagging would help people avoid such sentences, except if there is a tag "robot-generated boring stuff for NLP - please ignore".
I've argued before, and others too, that the source of data shouldn't be altered by the tool that uses it, especially if there are several tools, if people want the source of data to stay collaborative and fruitful. It seems that the battle is lost, so be it. But, then we cannot be surprised if we see more and more users collaborating uninteresting sentences/translations or untranslated sentences. Maybe we should remove the "human-made" in the presentation of Tatoeba.
Some people whined about NLP. They are professionals, bla bla bla. That's horsecrap. I've done NLP. I see some of my colleagues do NLP weekly. None of them has ever polluted a corpus that people spent time to build to fit their NLP tool. (To be completely honest, most of these corpus aren't being open to public contributions help :P )
Then again, if the battle is lost, it's fine. I won't argue about variety, declinations in other languages, and all other smoke-screen arguments. The reality is that selfish people aren't good contributors, period. If I want declinations, I open a book about declinations, take the afternoon to code the logic and press the "Run" button. Save time and community peace.
If I start teaching French to 3-years old, I'll made my own private corpus of "Je suis *" in Tatoeba
Thanks for help.
I'm wodering you just created your account today, with no contribution yet and you are able to analyze stats. It's amazing!!!
The team is going to translate all of the sentences and you are welcome to help.
Again, for the proper names we are using (Toponymy, names of towns, cities, villages, rivers in Kabylia), an ongoing project about GPS in Kabyle is launched.
As you know, in Algeria, for political reasons, Kabyle is not taugh at school, instead, they are working on a novelangue named Tamazight that none speak just to make popular languages disapear and among them Kabyle language. Tatoeba is a quiet place for us to develop our language (Kabyle) and we want keep it quiet, so please, if you can help to translate and add better sentences, you are welcome to take part.
By the way, we have a page on FB and Youtube, you can identify yoursefl and take part on our activities about kabyle linguistics corpora.
Transparency is our motto.
Lang Kab Activist
"I'm wodering you just created your account today, with no contribution yet and you are able to analyze stats. It's amazing!!!"
Is it bothering you? Then, you may understand that something is not in the normal.
Why Tatoeba? You are a computer engineer. Create a platform!
Or just do not write this much near-duplicates here. It is for language learners, not GPS-s, NLP-s. It is for PEOPLE from PEOPLE, NOT for COMPUTERS from COMPUTERS!
Sometimes I'm not heard:
But in Hungarian, I do not stop when things get harsh.
Remélem, ami történt, nem történik meg újra. ;)
As an introduction, let me state that I disagree with your vision of how to contribute on Tatoeba (any corpus, not restricted to the kab), although I agree with your vision about your language. That being said, I would like to ask you one question:
If the member of your team are identified kabyle speakers, what about your translators? Are they identified **competent** writers?
I want to insist that is not a provocation but a real concern. We've already seen kabyle contributors contributing many bad translations of their own sentences into French or Spanish. That is very worrying for proofreaders and maintainers of these corpora as it introduced an unnecessary big load of work. I would like to remind that the policy of Tatoeba is to translate only what you can really translate, not what you think you can translate.
I've already suggested to a couple of users that they partner up with native speakers, like I have, who can help clear up misunderstandings in translation, but they insist their sentences are natural.
They(?) don't want to hear you, and I won't supervise thousands of sentences on a weekly basis, others neither.
I hoped that you understand.
I have to say that in Spanish the quality of those translations is quite poor, in some cases nobody even knows what they're supposed to mean, and they add so many sentences that I simply started ignoring all of them and not proofreading them at all, unless I'm tagged in the comments and asked for my opinion. I used to check other people's sentences in Spanish mainly to see if I could link them with some other language I speak or understand but now I just link my own most of the time.
For wrong translations (mainly syntax and grammar), two solutions are possible: 1- If the translator makes lot of errors, we can change the owner and give them to a native speaker. 2- Add a fucntionality to block users translating to other languages other than his native's one.
We suggested some toponyms and proper nouns (and not all toponyms or all proper ouns) for some reasons, among them : NLP, quizs and learning material for kab kids.... We are not suggesting to change Boston, Tom, October, Mary... like some laguages did. We are keeping translating mainly from English, but we inserted some sentences for some other purpose.
Your two solutions are rather forceful and mean more work for the admins and developers. I think it would be better if users could be convinced to make high-quality contributions voluntarily. After all, it's in your (and presumably their) best interest to have good translations, and one careful, well-done translation is more helpful than ten hastily-written bad ones.
Maybe you could contact other Kabyle contributors (in your Facebook group?) and ask them to be mindful of the other participants in the project, to avoid adding so many sentences that proofreaders are overwhelmed, focusing on quality over quantity, and to proofread each other's sentences themselves.
> 1- If the translator makes lot of errors, we can change the owner and give them to a native speaker.
There are a number of problems with this approach. You may not even be able to find a native who also speaks the language from which the sentence was translated. And even if you can, it's not right to force this native to take on work that could have been avoided by preventing the creation of the bad sentences in the first place.
> 2- Add a fucntionality to block users translating to other languages other than his native's one
I think you're saying that the site should automatically prevent people from creating sentences in languages in which they have self-reported their skill as below-native (four stars or fewer). The thing is that four-star level is sufficient for contributing sentences given certain limitations (a low error rate, a manageable number of contributions, demonstrated ability to understand and respond to corrections, etc.). If people are determined to contribute sentences in languages that they speak well, but not quite at native level, they may start to self-report as being natives. That would be a problem, since it is helpful for us to know which sentences are contributed by natives and which are not.
We can't rely on software to do everything for us. There are some tasks that we need to do as humans. One of them is contacting people who consistently add bad sentences.
Many of the untranslated sentences posted by those users are untranslatable and hardly make any sense. I invite the authors of those sentences to translate them into languages that non-Berber speaking members of Tatoeba could understand (Arabic, French, English, etc.).
I want to clarify a statement I made in an earlier thread that begins here: https://tatoeba.org/eng/wall/sh...message_36385. passerby mentioned sets of large numbers of near-duplicate sentences differing only in place name. I commented that this was due to a specific purpose, namely for use with an open-source Kabyle GPS. My point was that it was not done simply to win a competition or out of spite. However, to build on a point that Aiji later made, the fact that it's being done for a concrete real-world purpose does not mean that it's good for a corpus designed for humans (our core customers).
Presumably, the GPS project has some schedule. Has the Kabyle team considered collecting the existing sentences, then perhaps deleting them, or at least bringing the contribution of such sentences to a close?
Contributors should only be writing sentences in languages that they know very well:
If you see a contributor writing sentences in a language that they clearly do not know very well, take action:
**A Brief History of the Tatoeba Project **
Perhaps some of our newer members might be interested in this.
It hasn't been recently updated. Perhaps someone would volunteer to maintain this.
HAPPY NEW AMAZIGH YEAR TO TATOEBA
ASEGGAS AMAZIƔ AMEGGAZ I TATOEBA
On the occasion of the New Amazigh Year (Yennayer, Iɣef n Useggas Amaziɣ), I'd love to say Happy New Amazigh Year to everyone on Tatoeba :-)
The New Amazigh Year is traditionally celebrated by all Amazigh speakers and North Africans in general. It's also recognized as an official holiday in Algeria (January 12).
It was recognized in 2018:
We corrected some errors on transifex (Website UI localization) since days and I wonder if synchronization is made at regular times.
I've updated the translations a few hours ago.
We usually update the translations every week together with code updates.
Is it okay to write 600 sentences in 56 minutes? ("That's a big number, how can I do that" - some asking.) You just copy and paste sentences and write in them all the verb forms...>> (sorry English, Swedish...)
(And, is it useful?) A bit yes but a big no, because we use some conjugated forms of a verb more often, and some conjugated forms just 'sound stupid' (as we Hungarians say) in the same setting.
>> ... or use a bot. (Do not use bots, it's prohibited.)
If you want mass import, contact with the admins.
Lauseen kääntäminen monella eri tavalla on arvokasta.
Monien samankaltaisten lauseiden lisääminen on arvokkaampaa kuin vain yhden lauseen lisääminen, mutta vähemmän arvokasta kuin monien erilaisten lauseiden lisääminen.
Niinpä, jos lisäät uusia lauseita, lisää mieluummin keskenään erilaisia kuin keskenään samanlaisia.
@Cabo this profile is not a bot
The contributors on the kab corpora are identified. We are a team working together on diffrent types of kab sentences. We are creating and sharing content before putting them on Tatoeba. Along with Tatoeba, we are also working on diffrent open projects.
"@Cabo this profile is not a bot"
Where did I write, the profil is a bot?
From my message:"This person also uses bots."
He/she used bots ... or it used bots?
If you really a computer engineer, you need to know the difference between informations and data.
"We are creating and sharing content before putting them on Tatoeba. "
Then leave there, don't put your "content" on Tatoeba, it is not the right place for those data.
@Cabo You seem very angry. Please let this space quiet. I leave the discussion.
Near-duplicate generating, bot using, attempts to pollute the corpus in other languages, like Spanish, Hungarian.
This place was quiet before these things.