menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
CK CK January 29, 2016, edited October 30, 2019 January 29, 2016 at 1:38:04 AM UTC, edited October 30, 2019 at 10:42:07 AM UTC link Permalink

[not needed anymore- removed by CK]

{{vm.hiddenReplies[25326] ? 'expand_more' : 'expand_less'}} hide replies show replies
pullnosemans pullnosemans January 29, 2016, edited January 29, 2016 January 29, 2016 at 5:23:45 AM UTC, edited January 29, 2016 at 5:32:21 AM UTC link Permalink

an important wall post, I think. let's see if we can actually get something going.

I think the main problem is that tatoeba right now has no clear identity as to where it wants to go (this is only my impression, no implication on trang's vision or anything like that). I have the impression that the project has no real means to actually ensure any quality, its openness is as much a curse right now as it is a blessing.

maybe through a clearer division of roles among contributors things could become more stable. my ideas for features that would reduce the openness of the site, but might nonetheless be beneficial in the long run are:

- increasing the number of corpus maintainers and encouraging them to take more radical action if they think sentences need to be changed or deleted
- forming bodies of trusted contributors who are known to be able to create good sentences and translations and emphasizing their importance to the project
- introducing a feature where contributors can be labeled responsible for creating translations from one specific language into one or maybe several specific language(s)
- creating a forum open only to key members (advanced/trusted contributors, corpus maintainers, etc.) with features for discussion and creating an overview over stuff that needs to be done (evaluating sentences, etc.)

the effects of these features could be increased by limiting the contributing rights of unknown members (e.g. introducing a feature where their sentences have to be evaluated by a corpus maintainer or trusted contributor before being displayed for everyone to see) and advertising on other sites for people to contribute to tatoeba (e.g., for japanese speakers to take on the tanaka corpus and change all the sentences into good, natural japanese).

the second point leads to the question of money. is tatoeba right now creating any money that could be used for ads, or even small monetary rewards for contributors?

all these ideas are based on the notion to increase the motivation for competent people to invest time into making tatoeba into something more stable. I don't know the details about how e.g. wikipedia manages the cleaning up of their articles, but I just generally feel that this site could be much more than it already is, but is right now in a sort of identity crisis as it has become big enough for its ambitions to grow higher.

{{vm.hiddenReplies[25327] ? 'expand_more' : 'expand_less'}} hide replies show replies
pullnosemans pullnosemans January 29, 2016, edited January 29, 2016 January 29, 2016 at 5:44:14 AM UTC, edited January 29, 2016 at 5:48:58 AM UTC link Permalink

one more thing about tatoeba that right now paralyzes the possibility for change is the fact that many bad sentences have links to good ones in so many different languages that it is impossible for a single person to change them while verifying that the new, better sentence still is a fitting translation for all the sentences it is linked to. therefore, no bad sentence like this can be changed without either having all linked sentences verified and if necessary changed by various contributors (which would be bullshit) or losing links to any language that the respective contributor does not know well.

sacredceltic sacredceltic January 29, 2016, edited January 29, 2016 January 29, 2016 at 8:35:07 AM UTC, edited January 29, 2016 at 8:36:22 AM UTC link Permalink

In my view, the main issue is to avoid deterring new quality contributors by exposing the bad quality of some of the sentences that are shown on the service's main page, because this is the first contact with the service, and it should be immaculate.

So the random sentence that is shown on Tatoeba's main screen should only be an owned sentence, and possibly owned by a self-proclaimed native (although I'm afraid this would entice more contributors to lie about their native language...), or an OK-tagged sentence (tagged OK by a self-proclaimed native).

I would not change the retrieval rules that apply to searches, so even unowned sentences would be retrieved, so they can be adopted and improved.

Once quality contributors have joined, they progressively understand the issues and work toward a better quality rather than be deterred by the corpus' current state.
But they must first be made to join.

{{vm.hiddenReplies[25330] ? 'expand_more' : 'expand_less'}} hide replies show replies
Hybrid Hybrid January 29, 2016 January 29, 2016 at 8:08:50 PM UTC link Permalink

"the random sentence that is shown on Tatoeba's main screen should only be an owned sentence"

I agree.

Hybrid Hybrid January 29, 2016, edited January 29, 2016 January 29, 2016 at 8:11:48 PM UTC, edited January 29, 2016 at 8:43:26 PM UTC link Permalink

"Do you have any other suggestions on how we can improve the quality of the Tatoeba Corpus?"

There's also the rating system that can be activated in the settings. Maybe this could be used to improve the quality of sentences in the future.

Edit: Also I think that we shouldn't be allowed to rate our own sentences.

It might also be a good idea to only allow native speakers, or very proficient non-native speakers, to rate sentences.

{{vm.hiddenReplies[25332] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic January 30, 2016 January 30, 2016 at 11:08:12 AM UTC link Permalink

Well, I'm deeply opposed to this rating. Over the time, you will realise that the so-called "wisdom of the crowd" is nothing else than dumbness of the crowd.
If you want to convince yourself, visit the http://www.urbandictionary.com/ which is absolutely full of crappy definitions coined by people thinking they're smart and funny.

A language is not the result of democracy. A language is LEARNT.
Otherwise, schools and teachers would serve no purpose.

{{vm.hiddenReplies[25334] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK January 30, 2016, edited January 30, 2016 January 30, 2016 at 11:20:19 AM UTC, edited January 30, 2016 at 1:23:06 PM UTC link Permalink

...

{{vm.hiddenReplies[25335] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic January 30, 2016 January 30, 2016 at 11:38:30 AM UTC link Permalink

hey, my opinions are not subject to voting, either...

sacredceltic sacredceltic January 30, 2016 January 30, 2016 at 1:18:33 PM UTC link Permalink

hey, I was joking. It's not because I don't use emoticons to evidence it that I don't crack jokes...

wells wells January 30, 2016 January 30, 2016 at 11:58:38 AM UTC link Permalink

I generally only use the rating system to warn users that there's something odd or unnatural about the sentence, but I don't see a way to improve the sentence concrete enough to leave a comment. Not that it acts much as a warning if it's hidden by default.

I'm not sure why you think its use will somehow develop into an undesirable consensus. Can you elaborate on that?

{{vm.hiddenReplies[25337] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic January 30, 2016 January 30, 2016 at 1:59:33 PM UTC link Permalink

>I'm not sure why you think its use will somehow develop into an undesirable consensus. Can you elaborate on that?

The proof of the undesirable consensus is in the pudding.

If you watch urban dictionary, for instance, you'll see that whenever a "smart" contributor coins a silly definition for a word, it's immediately voted up by many others, who think they're funny, and that is the way crappy definitions end up having a high rank.

But there are also countless exemples on Internet (and also in the media, alas) of wrong syntaxes or spellings that are progressively becoming dominant since uneducated users or non-natives come to outnumber educated natives, and their belief that what they write is correct is reinforced by what they see on Internet.

Famous French soccer players, for instance, who are notoriously illiterate, draw thousands of fans on their twitter accounts, ready to use and republish any ineptitude they write.

I know what you will retort : that's how languages evolve (or so people believe...), but it isn't true, otherwise schools and teachers would not even know what spelling or syntax to instruct.

Internet has changed it all, because now, illiterate people publish the most.

I know English may evolve a lot under this pressure, especially since English has few rules. But that is not the same in other languages to which far more rules apply, and where mistakes are more obvious in their regard.

In France and the UK, countries I know well (but it must be the same elsewhere, as far as I know), the way the language is used, relative to education, actually serves to screen people socially. Mistakes in language use are spotted immediately by upper-class people (those that are to grant jobs...)
So you may argue that all language uses are equal as long as they're popular, but that is just not real.
A parallel example in French is the following : A majority of French people mispronounce « Les haricots ». At first, you may say : "So what ? Then their wrong pronunciation is the correct one". But once you've said that, you haven't helped much the "mispronouncers" getting a job, because this mispronunciation actually works as a social/educational marker for French educated people. It tells them immediately what is the educational level of the person saying it...

A voting system would actually strengthen people in their mistakes, to their own detriment, creating havoc in language rules that would end up being impossible to teach.

PS: what would you say about voting for mathematical results ? Let's make a test.

2+2 = ?

a) 4
b) 2
c) 0
d) 5

I vote for c)

{{vm.hiddenReplies[25342] ? 'expand_more' : 'expand_less'}} hide replies show replies
al_ex_an_der al_ex_an_der January 30, 2016, edited January 30, 2016 January 30, 2016 at 2:57:48 PM UTC, edited January 30, 2016 at 3:10:54 PM UTC link Permalink

2 + 2 = 0 ? I'm not quite sure.
Two plus two is many, I guess. And many may be not exactly zero. ;-)
But beside of that, you are right; the crowd isn't allways right.
Unfortunately, by far not always. :(

{{vm.hiddenReplies[25343] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic January 30, 2016 January 30, 2016 at 3:12:56 PM UTC link Permalink

"many" is not an option. You'll have to set your own concurrent poll, to make things worse...

sacredceltic sacredceltic January 30, 2016 January 30, 2016 at 3:21:15 PM UTC link Permalink

for me, anything + anything = 0, because I'm a nihilist. I strongly believe all things emerge from nothingness and return to it. My theory is actually backed by most astro-physicists, nowadays.
c) option should definitely win.

wells wells January 30, 2016 January 30, 2016 at 6:33:14 PM UTC link Permalink

Well, that was a wall of text. I frankly don't see much relation to what I asked about ("how come a OK / Not OK system will lead to bad sentences") and what you replied. But I'll get a bunch of irrelevant things replied first.

> But there are also countless exemples on Internet (and also in the media, alas) of
> wrong syntaxes or spellings that are progressively becoming dominant

I'm not quite sure what you are talking about. I don't know French much but I mostly see 'quel' / 'quelle' changed to 'kel', kind of like the English 'your' becomes 'ur'. Nobody in their right mind would consider those to be correct. Well, nobody past primary school age.

There was someone earlier today commenting on French sentences with colloquial contractions, asking them to be changed to use "proper" language, by the way.

> I know what you will retort : that's how languages evolve (or so people believe...)

No, I was not going to retort that. I've seen that argument get used and think it's stupid.

> In France and the UK, countries I know well (but it must be the same elsewhere, as
> far as I know), the way the language is used, relative to education, actually serves
> to screen people socially. Mistakes in language use are spotted immediately by
> upper-class people (those that are to grant jobs...)

Same here. People who never learnt the most basic compound word rules are numerous. Not that I would advocate using the rules to learn compound words -- you seem to either subconsciously know them or you don't. You can probably practise with the rules to get your subconscious up to speed, or so I think. The part about employment I find odd though -- there are plenty of employers (white, blue, and pink collars) who don't themselves know the rules, so how would they judge the applicant based on that?

> A majority of French people mispronounce « Les haricots »

"Aspirated h", or so Wiktionary tells me. Pronounced [le a.ʁi.ko], and mispronunciation would then be [lez a.ʁi.ko] I guess?

Now, finally the meat of the question.

> If you watch urban dictionary, for instance, you'll see that whenever a "smart"
> contributor coins a silly definition for a word, it's immediately voted up by many others,
> who think they're funny, and that is the way crappy definitions end up having a high rank.

No, I see thousands upon thousands of words with a single definition without a single person ever having voted on it. Usually someone coining a pair of words with the intent to insult their friend.

I do not think it applies to Tatoeba at all. Either a sentence is good or it is not good. There is no popularity contest between sentences. Or are you going to argue that « J'ai eu un mal de tête. » is somehow competing with « J'ai eu mal au crâne. » ? That one of them is wittier than the other?

There are about 300 000 French sentences on Tatoeba currently. You'd need an audience of at least three million French speakers all rating sentences to cover a significant part of the corpus, not a single one commenting that they think something about a sentence is wrong.

I'm much more concerned about many sentences turning out somewhat unnatural due the virtue of being translated with the mind that each word of the original must be represented in the target sentence. "Word by word", if you will. It happens all the time here in my experience, especially on more complex sentences. I think learners should be steered clear of these sentences whenever possible. (Yet somehow the professional translators manage to produce fine sentences one after another. I applaud them -- translating is a tough job.)

{{vm.hiddenReplies[25348] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic January 30, 2016 January 30, 2016 at 6:55:27 PM UTC link Permalink

>I do not think it applies to Tatoeba at all. Either a sentence is good or it is not good.

Good, so I misunderstood your initial comment and we agree on that. But don't go believing that we're the majority here. Far from it. I have had countless debates with contributors here, including many students in linguistics, who advocate freedom to write as they wish and democracy as a way to assess the validity of their sentences, the "Facebook" like ruling of languages as against Academies that they all consider to be bunches of useless old nutters (age being an-OK criteria to disparage people, if I understand them well...)

pullnosemans pullnosemans January 31, 2016, edited January 31, 2016 January 31, 2016 at 11:42:59 AM UTC, edited January 31, 2016 at 12:07:01 PM UTC link Permalink

"If you watch urban dictionary, for instance, you'll see that whenever a "smart" contributor coins a silly definition for a word, it's immediately voted up by many others, who think they're funny, and that is the way crappy definitions end up having a high rank."

that's unfortunate. why exactly are you so convinced that tatoeba will show the same result, especially seeing as tatoeba is not a site with a humorous component like urban dictionary is?


"But there are also countless exemples on Internet (and also in the media, alas) of wrong syntaxes or spellings that are progressively becoming dominant since uneducated users or non-natives come to outnumber educated natives, and their belief that what they write is correct is reinforced by what they see on Internet."

so you are saying language use should generally be dictated by a small elite upper class?


"I know what you will retort : that's how languages evolve (or so people believe...), but it isn't true, otherwise schools and teachers would not even know what spelling or syntax to instruct."

I don't think I really understand this conclusion. what relevance do teachers play for a phenomenon such as language, which is acquired naturally by most humans?


"I know English may evolve a lot under this pressure, especially since English has few rules. But that is not the same in other languages to which far more rules apply, and where mistakes are more obvious in their regard."

have you ever counted the "rules" in english? do you have a statistic with an average of rules per language on the globe?
edit - added later: and in the framework of which grammatical theory are you making this quite bold claim, which even many reputed professors of comparative linguistics would never claim to have a deep enough understanding of the way language works in our brains to be able to make?


"A parallel example in French is the following : A majority of French people mispronounce « Les haricots ». At first, you may say : "So what ? Then their wrong pronunciation is the correct one"."

no, no, nobody says that, you are the one speaking about there being "the" correct one.


"But once you've said that, you haven't helped much the "mispronouncers" getting a job, because this mispronunciation actually works as a social/educational marker for French educated people. It tells them immediately what is the educational level of the person saying it..."

I don't know if this applies to all "educated people", or only to those who take delight in regarding their idiolectal variety of their native language as the only one with a right to exist.


"A voting system would actually strengthen people in their mistakes, to their own detriment, creating havoc in language rules that would end up being impossible to teach."

I would say languages are per se impossible to teach. they can only be learned, through careful observation. all teachers ever could do for me at least was give me access to material (including their own native output) for me to work with.

{{vm.hiddenReplies[25362] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic January 31, 2016 January 31, 2016 at 3:23:04 PM UTC link Permalink

>so you are saying language use should generally be dictated by a small elite upper class?

I'm not saying it should. I'm saying it actually is. Upper classes will always define ways of differentiating themselves from the rest of the people. This is their very way to exist. And language is the prime differentiator. If you don't realise that, you don't know your own society.

>what relevance do teachers play for a phenomenon such as language, which is acquired naturally by most humans?

A pity. Teachers are supposed to teach language to your children. Maybe you escaped school. I agree that their relevance is more and more debatable at Internet age. However, they're still key to language education for most, and language education is the very base of education in general. Developed countries spend an awful lot of money on teachers. I hope they actually serve some kind of purpose...
I, for one (along with millions of other French children), learnt spelling and grammar at school.

>have you ever counted the "rules" in english? do you have a statistic with an average of rules per language on the globe?

No, and I don't need it. English has no rules for vowel pronunciation, for example, when Dutch, German, French or Spanish have. That's less rules.

>you are the one speaking about there being "the" correct one.

No, society does. The elite defines how things should be pronounced (usually, with good logical grounds...), not me. I'm just immersed in the society and I merely observe the phenomenon. But, yes I do think phonetic rules are handy, because they enable people to pronounce words they've never seen or heard before (which is not possible in English, but is in Spanish, Dutch, French...) and I agree to their imposition by the elite, because once you know the rules, disrespect of them sounds ugly.

A language is a protocol. It's subject to informal acceptance. It works exactly like politeness : Some people enter a restaurant without smiling and saying Good day, others don't. You can't prevent the attendance from judging which way is more appropriate and civil and to act accordingly.

>I would say languages are per se impossible to teach.

Then all these useless teachers should be made redundant and the building of schools should be stopped at once. In France, this is the state's number one spending. Every taxpayer would subsequently pay 30% less tax.

{{vm.hiddenReplies[25367] ? 'expand_more' : 'expand_less'}} hide replies show replies
pullnosemans pullnosemans February 1, 2016, edited February 1, 2016 February 1, 2016 at 2:33:00 AM UTC, edited February 1, 2016 at 2:34:08 AM UTC link Permalink

ah, I see. so the central misunderstanding here is that you were actually mostly speaking of written language, when I thought you were talking about language in general. under this premise I can understand much better why you are saying what you are saying, because written language is largely a human construct and thus indeed primarily taught, not acquired naturally.
in this case, I can also agree that the system of english orthography is among the most erratic and inconsequent alphabet-based systems in the world, if not the most erratic (I assume this is what you meant when talking about english having fewer rules than many other languages). however, I still don't see how this would lead to english writing being more prone to language change.

and so you suggest that since we want tatoeba to serve an educational purpose and linguistic prescription by influential people is simply real, the site should have an orientation toward high-prestige conservative language varieties. I guess that's a relatable opinion, even if I ideologically oppose language prescription on an a priori basis.

glad we worked this out. I ask you to be more explicit about your focus on written language when making statements about the character or profile of languages, or talking about language education. I think it would make your point of view easier to understand and help create dialogue, especially for those whose perspective differs from yours.

{{vm.hiddenReplies[25386] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic February 1, 2016 February 1, 2016 at 10:42:46 AM UTC link Permalink

>however, I still don't see how this would lead to english writing being more prone to language change.

Prescriptions frames and slows changes. When there are no prescriptions, changes occur more frequently and erratically.

>and so you suggest that since we want tatoeba to serve an educational purpose and linguistic prescription by influential people is simply real, the site should have an orientation toward high-prestige conservative language varieties. I guess that's a relatable opinion, even if I ideologically oppose language prescription on an a priori basis.

No I don't. I think you misread me.
I'm just saying languages are not democratic. Sentences are either correct or not, by the standards of a given period of time and space, and these standards happen to be more applied (and enforced) by upper-class people who tend to have a longer education, enabling them to better master these standards.

This works the same way in large developed societies and small primitive tribes.

If, for instance, you were an ethno-linguist and you would want to learn and conserve a recently discovered language from a tribe in Central Papua, would you rather learn it from the young from this tribe, asking them to vote for words meanings and correct syntax, although these young started speaking pidgin in the last 20 years, through contact with other tribes, because they find it cool to go global, or would you rather try to learn it from old tribe members who know the myths, and tell their stories to children ?

So if voting is no good for ethno-linguists, as a tool to comprehend a language, why should it be good for us ?

And by the way, my purpose on Tatoeba is not educational, but rather conservatorial (which is not contradictory with an educational use). I coin sentences from all registers of society and I store, on Tatoeba, sentences I heard or read and that are sometimes nowhere else on Internet.

I realised - funnily through a Google search that retrieved Tatoeba sentences - that my native language was mis-represented on Internet and this came as a shock to me and that is why I've been so involved.

This misrepresentation is caused, not only by the distortion brought by the fact that more uneducated or non-native people write on Internet than educated native ones, but also because only certain things are written (and sometimes only certain things are written specifically on Internet) while others are not, for instance, sentences from the intimate or childhood register, or local turns of phrases that people deem good enough to say but not to write, and that interests me much.

Internet is the global scene on which everybody wants to act, talented or not, and it sums up to a representation that is neither very natural nor very rich or diverse, let alone beautiful.

User55521 User55521 February 5, 2016 February 5, 2016 at 9:45:41 AM UTC link Permalink

> If you want to convince yourself, visit the
> http://www.urbandictionary.com/ which
> is absolutely full of crappy definitions coined
> by people thinking they're smart and funny.

I find urbandictionary.com extremely helpful when dealing with English slang expressions.

If urbandictionary.com is supposed to be an example of how 'wisdom of the crowd' doesn't work for linguistic data, it's not a convincing one.

{{vm.hiddenReplies[25416] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic February 5, 2016, edited February 5, 2016 February 5, 2016 at 9:55:30 AM UTC, edited February 5, 2016 at 10:31:58 AM UTC link Permalink

Sure, here is a typical example http://www.urbandictionary.com/...hp?term=snatch

You'll notice that the correct definition is only the 3d, with only 1419 votes up, when the 1st got 4066.
That means that 4066 idiots took the pain to do this.
It's a clear proof that so-called "crowd-wisdom" is just crowd-idiocy. And it's massive.
I can't see why Tatoeba would be different, since the same people use both services...

{{vm.hiddenReplies[25417] ? 'expand_more' : 'expand_less'}} hide replies show replies
cueyayotl cueyayotl February 5, 2016, edited February 5, 2016 February 5, 2016 at 10:35:46 AM UTC, edited February 5, 2016 at 10:36:40 AM UTC link Permalink

>> Sure, here is a typical example http://www.urbandictionary.com/...hp?term=snatch

It refers you to another word ('cunt') and gives you an example sentence (which should give one the idea of what the word means already). When you click on the word 'cunt' it sends you to its page, and after seeing that the definition 'woman' doesn't make sense in the original example sentence, we have the definition: "A synonym for a woman's genitalia, vagina, pussy, etc." Now we have learned the definition of the word, as we wanted.

'Wisdom of the crowd' works fine here. Language is not an exact science, nor does it necessarily follow logic; a sentence being correct in a language cannot be compared to "1+1=2" or "1+1≠2".

{{vm.hiddenReplies[25418] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic February 5, 2016 February 5, 2016 at 11:23:01 AM UTC link Permalink

I don't know what you're talking about with your "logic".
The topic here is "crowd wisdom" to be applied to Tatoeba sentences.
I just provided a good exemple of crowd idiocy, in an Internet linguistic service, parallel to what Tatoeba does.

{{vm.hiddenReplies[25419] ? 'expand_more' : 'expand_less'}} hide replies show replies
cueyayotl cueyayotl February 5, 2016 February 5, 2016 at 4:56:34 PM UTC link Permalink

-- I don't know what you're talking about with your "logic".

>> PS: what would you say about voting for mathematical results ? Let's make a test.

>> 2+2 = ?

>> a) 4
>> b) 2
>> c) 0
>> d) 5

>> I vote for c)

-- I just provided a good exemple of crowd idiocy

>> It [Urbandictionary] refers you to another word ('cunt') and gives you an example sentence (which should give one the idea of what the word means already). When you click on the word 'cunt' it sends you to its page, and after seeing that the definition 'woman' doesn't make sense in the original example sentence, we have the definition: "A synonym for a woman's genitalia, vagina, pussy, etc." Now we have learned the definition of the word, as we wanted.

Maybe there are examples of crowd idiocy on the site, but there just aren't any in this particular entry. Could you please provide another example? Thank you in advance.

{{vm.hiddenReplies[25420] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic February 5, 2016 February 5, 2016 at 5:20:09 PM UTC link Permalink

Crowd iodiocy is when 4066 people vote up the definition that is not the more complete and accurate. Only 1419 voted up the complete and accurate definition, which ranks only 3d.

That proves brilliantly that the public, on linguistic Internet sites, is not wise, and prefers to crack bad jokes and ridicule the service that is supposed to be rendered.

Hence my conclusion : crowd wisdom is crowd idiocy. QED

Only to you doesn't it look obvious, bizarrely... Logic, you were saying ??

{{vm.hiddenReplies[25421] ? 'expand_more' : 'expand_less'}} hide replies show replies
wells wells February 5, 2016, edited February 5, 2016 February 5, 2016 at 6:18:30 PM UTC, edited February 5, 2016 at 6:19:22 PM UTC link Permalink

May I point out that the first definition had a four-year head start?

And that the place is called *Urban* dictionary, which aims to define slang and colloquialisms? Those senses that are not found in ordinary dictionaries.

And that none of the verb senses are slang, and thus not really at home on the site? "to snatch" has been in English for about a thousand years, if not more.

I think you are reading too much into this.

{{vm.hiddenReplies[25422] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic February 5, 2016, edited February 5, 2016 February 5, 2016 at 7:48:24 PM UTC, edited February 5, 2016 at 8:07:38 PM UTC link Permalink

>"to snatch" has been in English for about a thousand years, if not more.

English is not that old...

>I think you are reading too much into this.

Precisely not.

First, "Urban" doesn't equate to colloquialisms in my dictionary.
It means "from the city", as its latin urba root points to.
I don't think "snatch" means more "vagina" as a first thought, in cities than in the countryside. Or maybe you think rednecks are ignorant ?

The fact remains, nevertheless, that 4000+ people took the pain, on purpose, to vote up a definition in a dictionary "from the city" that is provocative and very narrow-minded.

There is no reason to think that people would proceed differently on Tatoeba. I know millions of teenagers who would laugh their brains out voting up the most ridiculous sentences, and that would result in a ridiculous exhibition of counter-example sentences, precisely the contrary to what Tatoeba is trying to promote.

Not everybody is satisfied that the definition of "snatch" comes up with "cunt" as the only definition in an online popular dictionary on Internet.
But Okay, you might find it funny (if you're under 25...)
But do you realise that search engines actually take it for granted ?

Well, Tatoeba is the same. When a sentence is wrong, it shows up in Google searches anyway, and people subsequently take it as is.
And Tatoeba is very well indexed by search-engines, since Tatoeba has a serious and consistent purpose that makes it much favoured by these engines (search engines are optimised to downrate inconsistent websites, that are usually redirecting to unwelcome advertising...)

Yesterday, people used to learn languages from books that where edited (ie, corrected). Today, they learn it from Internet, where anything is and will be written, without any correction (just read Wikipedia or Twitter if in doubt...)
I'm not saying we can prevent Internet from being full of crap, this is already hopeless. But at least, we can prevent that the most favoured results from smart search engines are not only crap. Otherwise, I can't tell you at what rate our languages are going to degrade.

TRANG TRANG February 1, 2016 February 1, 2016 at 9:03:09 PM UTC link Permalink

​The problem of quality is a long time problem and many things have already been suggested and discussed in the past. I'll try to cover the various points that have been mentioned int his thread.


1. Unadopted sentences

Unadopted sentences are a bit difficult to deal with. They are usually not bad to the point that it's clear they should be deleted, otherwise they would simply be deleted, but they are often bad enough that nobody feels like taking care of them.

Removing them from the corpus is an option, but doesn't sound like the best option.
I am not against the idea though. I had discussed, a while ago, the case of the unadopted Japanese sentences and asked what if we simply delete them? The answer I got was basically that they are not harmful to the point that they should be deleted. Therefore in the case of Japanese, we will keep them.
But if, for another language, the community considers that the unadopted sentences are basically useless should be all deleted, I would not necessarily reject the idea.

That being said, I agree that we should make them less accessible to the lambda users. We already do it by displaying unadopted sentences after owned sentences in the search results.
But we should also, as it was suggested, not display them in the random sentences.
We should as well probably prevent users to translate unadopted sentences since there are many owned sentences that are waiting to be translated.


2. Sentences contributed by non-native members

The idea of limiting contributors to contribute only in their native language cannot be applied to all languages in my opinion. It cannot be applied to all contributors either.

There are languages for which we have almost no chance to find native contributors, but for which we have passionate non-native contributors who want to do what they can to document these languages. I don't see a good reason to discourage that. Sure, the sentences may suffer in terms of quality, but does quality matters so much in this case?

There are users who are capable of creating correct sentences in foreign languages, even though they are not native speakers of the language. There are users who have connections with native speakers, and have the possibility to ask these native speakers to correct their sentences before they submit them into Tatoeba. As long as a contributor is careful enough, I wouldn't stop them from contributing in a foreign language.

Also being a native speaker does not ensure good quality. Some native speakers can contribute pretty bad sentences. And even good contributors can make mistakes when adding new sentences or translating sentences.

So I don't think we could make it a general rule, to limit members to contribute only in their native language. This is something that should rather be decided case by case, for each user.

One thing we might need is to set up some better, more scalable, mechanism to temporarily ban contributors from adding sentences in a certain language because the average quality of their contributions in that language is below standard.

At the moment the only thing we have is a functionality that is available only to admins: admins can set the "level" of a user to "-1". As a result, the user is not able to add or translate any sentences anymore. They can can only edit their current sentences.
The fact that only admins can access this functionality, and that we don't have enough admins, and that there's no way to customize the restriction to specific languages, makes this functionality suboptimal.


3. Corpus maintainers and trusted users

Pullnosemans suggested that a possible way to improve quality is to increase the number of corpus maintainers and form bodies of trusted contributors. I agree that the number of corpus maintainers and trusted contributors has (or at least should have) a positive influence on the quality of the corpus. I'm all up for having more corpus maintainers and advanced contributors. But how do we achieve this? And how do we make sure that we are promoting the right people?

At the moment we lack people who want to or can invest time and effort into building a stronger community of contributors. We probably also lack people who are motivated to take new responsibilities, or who have the right mindset, knowledge and skills to take these responsibilities.
We can't just accept anybody for the role of corpus maintainer, and we can't force contributors to be corpus maintainers either. There are people who want to become corpus maintainers but are not ready for it. There are people who would be great corpus maintainers but don't want to take the responsibilities.

Maybe there are certain things that we're doing wrong and we could fix, to create a more engaged community. But I can't see this happening if we don't have very dedicated and active community admins.

On a side node, I want to clarify that corpus maintainers actually do have the possibility to take radical actions if they think a sentence needs to be changed or deleted. Our guidelines for corpus maintainers is that they first post a comment, then wait two weeks in order to give the sentence's owner, as well as other members, the time to react to the comment. If there is no reaction after these two weeks, then the corpus maintainer can freely change the sentence or delete it.
This is only a guideline though. Corpus maintainers can skip the two week delay, and delete or edit a sentence right away, if they consider that an action needs to be done urgently.


4. Money

When it comes to quality, I'm not convinced money can do a lot for the project. I don't reject the idea of rewarding competent contributors, or of having people being paid to maintain the corpus as full time or part time job. But there are several problems.

Obviously there is the problem of funding: where do we find the money? Can we even gather enough funding to reach the level of quality that we're dreaming of? How much would we have to invest? How much would you like to be paid, to correct sentences in the corpus?

Now even with all the money in the world, how would we evaluate that someone is actually competent? How do we evaluate that someone is doing a good job at improving the corpus' quality? How do we know that we're not just wasting money?

There will also be problems of prioritization: what will justify that we invest more money in a certain language than on another?

We do have some funds, from donations, that we could use for something else than paying the server. But nothing substantial. We can of course try to run a donation campaign, if we have a clear plan of what to do with the money that we would raise. But right now I don't see any clear plan.


5. Sentence ratings/collections

The sentence ratings/collections (I'll call it ratings for the rest of this message) is, for me, an important feature to improve quality and it has been implemented with the problem of quality in mind. This is not about implementing a voting system. This is about providing an infrastructure for contributors to evaluate the quality of the sentences in the corpus, and for users/learners to have a better indicator to decide if they can to rely on a sentence or not.

​One feature that was suggested years ago was to have some sort of secondary corpus. The secondary corpus would basically be a space where users would be more "free" to add poor quality sentences. This would be the place where sentences from new users would be stored, and as well sentences from users who are contributing in their non-native languages. And if after verification, a sentence from the secondary corpus is considered to be actually good, it can be moved to the main corpus.

For me, the current Tatoeba corpus is that "secondary corpus". We don't have the "main corpus" yet but we should find a way to build it. For that, we would need to define criteria that would help us decide what is worthy to go into the main corpus.
For instance we could start with this criteria: all the sentences from corpus maintainers, that are written in the native language of the corpus maintainer, are worthy to be in the main corpus. We could extend the criteria to advanced contributors.
But then what about the rest of the contributors? And what about sentences from advanced contributors or corpus maintainers that have mistakes?

This is where the ratings system comes into play. It should provide a more accessible and standard a way for contributors mark sentences as they explore the corpus, to express their opinion about the quality/correctness/reliability (whatever you want to call it) of the sentences that they see.​ These opinions would serve an extra criteria to decide whether or not a sentence can go into the "main corpus".

The rating system is currently still in the experimental phase. And unfortunately, it probably will remain in an experimental phase for another year or so because I personally won't have a lot of time to dedicate to it and there are other topics that I consider higher priority. But who knows, maybe this year we will have new developers joining the team who will be motivated and inspired to work on this problem.


So that was a long rant... There is a lot more to to say on the topic but I won't have time for it. I hope nonetheless that it gives everyone a better vision of where Tatoeba stands when it comes to quality.

{{vm.hiddenReplies[25397] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic February 1, 2016 February 1, 2016 at 9:46:44 PM UTC link Permalink

Je suis toujours totalement opposé à  votre système pseudo-démocratique d'appréciation de la qualité des phrases.
Le fait d'être un administrateur du corpus ou un gestionnaire du corpus n'est pas un gage de qualité, ni non plus que plusieurs contributeurs, par ailleurs peut-être ignorants, approuvent une phrase.

Une phrase est soit correcte ou incorrecte. Ça ne dépend pas de l'appréciation de qui que ce soit. C'est juste un fait.

tommy_san tommy_san February 2, 2016 February 2, 2016 at 2:12:38 AM UTC link Permalink

> I had discussed, a while ago, the case of the unadopted Japanese sentences and asked what if we simply delete them? The answer I got was basically that they are not harmful to the point that they should be deleted. Therefore in the case of Japanese, we will keep them.

There are actually some (though not many) sentences that I find plainly wrong or clearly unnatural, and thus harmful. I rate them "not OK" and "unsure" (even though I'm not unsure about anything) respectively to warn other users. However, most people don't see my ratings, so these sentences keep getting translated, especially often by new members.

If the community thinks it's better for me to delete these sentences, I can do so. In that case, you'd need to excuse me for accidentally deleting sentences that are correct in some variety of Japanese I'm not familiar with, or even ones that are correct in standard Japanese that include a word or phrase I don't know. You'd also need to excuse me for deleting sentences that could be turned into good example sentences with some changes. I don't have the time or ability to improve them and make sure the new sentences match all the translations, and there are many sentences that, in my opinion, wouldn't make good standalone example sentences anyway.

By the way, I think it's really important for us to tell new members what to translate and what not to translate. Since we have both good and bad sentences, they should translate only when they're sure it's a good sentence. If they cannot judge the quality of sentences themselves (which is the case for many non-native speakers), it's better to choose sentences owned or tagged/marked OK by a self-identified native speaker. Whenever I notice, I tell this to members who translate bad sentences, but it's something every contributor should keep in mind.

I also wonder if we could develop a set of good sentences that contributors of every language could consider translating. The set would surely include sentences like "Hello" and "Thank you", but it doesn't have to be phrasebook-like. It could include any sentence that you find good and real and makes good sense out of context (such as "You made the mistake on purpose, didn't you?" and "Does this dress make me look fat?"). It's not that all contributors should translate from this set, but if they don't have a particular preference, it might be better for them (especially contributors of sentences with few sentences) to translate sentences from such a set than to simply translate recent or random sentences, which are often not very good.

{{vm.hiddenReplies[25403] ? 'expand_more' : 'expand_less'}} hide replies show replies
pullnosemans pullnosemans February 2, 2016, edited February 2, 2016 February 2, 2016 at 7:25:11 AM UTC, edited February 2, 2016 at 7:27:12 AM UTC link Permalink

"There are actually some (though not many) sentences that I find plainly wrong or clearly unnatural, and thus harmful. I rate them "not OK" and "unsure" (even though I'm not unsure about anything) respectively to warn other users. However, most people don't see my ratings, so these sentences keep getting translated, especially often by new members.
If the community thinks it's better for me to delete these sentences, I can do so. In that case, you'd need to excuse me for accidentally deleting sentences that are correct in some variety of Japanese I'm not familiar with, or even ones that are correct in standard Japanese that include a word or phrase I don't know. You'd also need to excuse me for deleting sentences that could be turned into good example sentences with some changes. I don't have the time or ability to improve them and make sure the new sentences match all the translations, and there are many sentences that, in my opinion, wouldn't make good standalone example sentences anyway."
(tommy_san)


as a non-native speaker both of japanese and english, I can say that I would absolutely be in favour of your doing so. I have the impression that tatoeba lacks japanese speaking members who are reliable and willing to take action, and that this is the main reason why the japanese corpus is still so corrupted, so I think it would absolutely be worth paying the price of having some potentially salvageable sentences be lost if we can get a start on the cleaning up of the japanese corpus for it.

and if you ask me, any english tanaka sentence not yet adopted can be deleted right with the japanese ones. "The outcries of the angels go unheard by ordinary human ears." may be a sentence you could potentially come across in poetry or lord of the ring style fiction, but without context on a site such as this, it just sounds silly to me.

I generally think we need to become bolder in cleaning up this page, even if that means that we lose some material that could maybe eventually at some point in time be useful. this goes for any language with a great number of bad sentences right now.

I'd rather have a change now, and then be able to build up from that with revised concepts.

{{vm.hiddenReplies[25405] ? 'expand_more' : 'expand_less'}} hide replies show replies
wells wells February 2, 2016 February 2, 2016 at 10:00:51 AM UTC link Permalink

> "The outcries of the angels go unheard by ordinary human ears."

Not a great example though. There was a Tanaka contributor or three who input short phrases he/she had written in his/her notebook, along with their translations. None of them were full sentences, or were even conjugated to form a sentence. 天使の叫び → "Angel's cry out" (#125025, #278968) was one of those pairs. Someone here tried to salvage the Japanese to make a full sentence, which eventually lead to the English translation that you quoted.

I'd prefer that the Japanese, if odd, be fixed or new translations supplied, but I understand that our most prolific contributor is rather antagonistic towards the idea. I don't blame the contributor -- it's a lot of sentences and a lot of work. We'd need a hundred active contributors working on the sentences to make a dent.

Anyway, anyone proposing the deletion of old unadopted Japanese sentences would first have to ask JimBreen well before enacting a plan of that sort. His dictionary is dependent on the indexing stored on Tatoeba and if you delete a sentence, the indexing goes as well, if I'm not mistaken.

Also, any deletion of a well-known language such as English will likely disperse the other translations based on it. For example, if you delete #63882, the Italian, Hebrew and Macedonian sentences will have no links and will no longer be grouped together.

{{vm.hiddenReplies[25409] ? 'expand_more' : 'expand_less'}} hide replies show replies
pullnosemans pullnosemans February 2, 2016 February 2, 2016 at 10:54:51 AM UTC link Permalink

everything you say is true, and the problem of losing links might be the biggest issue to take care of if we want significant change, but as I said, I think we should suck it up and deal with these issues as best as we can rather than being like, "yeah, there's too many problems, we can't really do anything right now, let's talk about it again next year".

TRANG TRANG February 6, 2016 February 6, 2016 at 12:01:01 PM UTC link Permalink

> There are actually some (though not many) sentences that I find plainly wrong
> or clearly unnatural, and thus harmful.
> [...]
> If the community thinks it's better for me to delete these sentences, I can do so.

As wells pointed out, we need to take into consideration if the deletion of the sentence will affect Jim Breen's dictionary. My suggestion would be to send him your list of "not OK" and "unsure" sentences and let him delete them or change them.


> I also wonder if we could develop a set of good sentences that contributors of
> every language could consider translating.

As CK pointed out, we can probably make lists for this.

We could establish some convention for the names of the list. For instance you could name your list "[translate:jpn] Translate Japanese sentences". Then from the "Translate sentences" page, we could add a section "Translate sentences from lists" which would be a search for all lists which name contains "[translate:<lang>]".

We could add some more restrictions to make sure that only lists from trusted contributors are displayed.

pullnosemans pullnosemans February 2, 2016, edited February 2, 2016 February 2, 2016 at 7:49:07 AM UTC, edited February 2, 2016 at 7:54:33 AM UTC link Permalink

"I had discussed, a while ago, the case of the unadopted Japanese sentences and asked what if we simply delete them? The answer I got was basically that they are not harmful to the point that they should be deleted. Therefore in the case of Japanese, we will keep them.
But if, for another language, the community considers that the unadopted sentences are basically useless should be all deleted, I would not necessarily reject the idea."
(TRANG)

as for this, see my reply to tommy_san's post above. I generally think it's better to lose some not-completely-downright-useless material than to stay inactive and get no improvement to the current problematic situation in the corpora of some languages (english, french, japanese, korean come to my mind).


"So I don't think we could make it a general rule, to limit members to contribute only in their native language. This is something that should rather be decided case by case, for each user."
(TRANG)

I agree. I also think this could well be combined with the creation of a more clearly defined body of trusted/responsible users.


"I'm all up for having more corpus maintainers and advanced contributors. But how do we achieve this? And how do we make sure that we are promoting the right people?
At the moment we lack people who want to or can invest time and effort into building a stronger community of contributors. We probably also lack people who are motivated to take new responsibilities, or who have the right mindset, knowledge and skills to take these responsibilities.
[...] There are people who want to become corpus maintainers but are not ready for it. There are people who would be great corpus maintainers but don't want to take the responsibilities.
Maybe there are certain things that we're doing wrong and we could fix, to create a more engaged community. But I can't see this happening if we don't have very dedicated and active community admins."
(TRANG)

we could have a one-week long poll that is advertised on the front page where users can state that they are willing to become part of an engaged, clearly defined community of people who work to systematically improve this site, by clear guidelines and clearly divided responsibilities. anyone who wants to be a part of it and doesn't seem untrustworthy picks or is assigned a certain job (maybe more important jobs for people already known for their good work on the project), and their activity is watched by the other members of the team. if anyone performs badly or there are problems, it is discussed with them in a group, and if no consensus is found, they can lose their responsibility. this doesn't mean, however, that they cannot regain it or get another one. I think the best way to verify that you're "promoting" the right people is to have a certain fluidity in the system, and have everyone be aware that they are collaborating in a team.
I, for one, would be very happy to be a corpus maintainer for german, since improving the sentences already existing interests me more than creating new ones. I once asked for this, but was told that german already had enough maintainers, which was fine with me (though I thought, "can you have too many maintainers?").


"4. Money"
(TRANG)

for now, I would say that recruiting new contributors via ads (possibly in combination with the poll for creating a dedicated core of contributors as I described above) would be more beneficial than paying people for contributing. this site does run on an open source concept, after all. everything else can be pondered over when there actually are immediate perspectives for getting a significant amount of funding.

{{vm.hiddenReplies[25407] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG February 6, 2016 February 6, 2016 at 12:29:42 PM UTC link Permalink

> we could have a one-week long poll that is advertised on the front page where
> users can state that they are willing to become part of an engaged, clearly
> defined community of people who work to systematically improve this site,
> by clear guidelines and clearly divided responsibilities.

I'm fine with the idea of having a poll but, by experience, I can say this is more of an ongoing tasks. We can of course definitely have some sort of "recruitment campaign", but it means someone has to invest time in organizing it.

With your idea for the poll,
- someone needs to creates the poll
- someone needs to write the necessary documentation/guidelines
- someone needs to take care of welcoming and guiding all the people who will want to join

It's actually a lot more work than it looks, and it's definitely not something I can invest too much time on. If you want this to happen and feel you can organize this, you are more than welcome to do so :) Let me know.


> I, for one, would be very happy to be a corpus maintainer for german, since
> improving the sentences already existing interests me more than creating new
> ones. I once asked for this, but was told that german already had enough
> maintainers, which was fine with me (though I thought, "can you have too many
> maintainers?").

I haven't been monitoring how maintenance goes in the German corpus, but it seemed to me that German doesn't suffer too much from quality and there isn't a huge amount of sentences to fix.

You can probably participate in improving existing sentences without being a corpus maintainer. What would you do as a corpus maintainer, that you cannot do right now, as an advanced contributor?

If it's only about getting things corrected faster, then we'll have to see how much is there to correct, and how slow are we. Having a hundred sentences that are waiting to be corrected for a month would still be rather fine to me. Having thousands of sentences that are waiting to be corrected for years is something else.