menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
AlanF_US AlanF_US January 3, 2023 January 3, 2023 at 11:53:21 PM UTC link Permalink

I'm starting a new thread with reference to this one:

https://tatoeba.org/en/wall/sho...#message_39361

in order to focus on my point:

Adding large quantities of near-duplicate sentences degrades the quality of the Tatoeba corpus.

Issues not relevant to that point:

(1) whether those near-duplicate sentences are translated, and by whom (note that translating a large quantity of near-duplicate sentences would produce a large number of near-duplicate translations)
(2) any purpose outside Tatoeba (such as a GPS) for which those sentences are intended
(3) how many people speak the language in question
(4) whether near-duplicate sentences exist in other languages
(5) whether sentences exist that violate our civility guidelines
(6) the chain of events that led to Talwit leaving the project
(7) what has happened to Talwit's sentences so far
(8) what should happen to Talwit's sentences in the future
(9) the flag used for the Kabyle language
(10) whether Berber and Kayble are separate languages
(11) whether members of Tatoeba think of them as separate languages
(12) whether we should have a cap on the rate of sentences added by a member
(13) what that cap should be
(14) how difficult it would be to code a capping mechanism

If anyone wants to explain how adding large quantities of near-duplicate sentences enhances the Tatoeba corpus, this is your opportunity.

{{vm.hiddenReplies[39380] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cangarejo Cangarejo January 4, 2023 January 4, 2023 at 12:58:55 PM UTC link Permalink

What’s your process for coming up with sentences?

{{vm.hiddenReplies[39383] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 30 days ago January 4, 2023 at 2:16:21 PM UTC link Permalink

Personally speaking, I mostly write translations, but when I come up with original sentences, I usually write them around a vocabulary item from either Tatominer ( https://tatominer.netlify.app/eng.html ) or vocabulary requests ( https://tatoeba.org/en/vocabulary/add_sentences ).

brauchinet brauchinet 30 days ago, edited 30 days ago January 4, 2023 at 6:57:59 PM UTC, edited January 4, 2023 at 7:00:50 PM UTC link Permalink

> Adding large quantities of near-duplicate sentences degrades the quality of the Tatoeba corpus.
(12) "whether we should have a cap on the rate of sentences added by a member" is relevant to that point. It prevents users from adding large quantities of sentences whatsoever.
A limit is just a way of making clear that mass production of sentences is unwanted.
Yes, people can create multiple accounts, but this argument would be presupposing some "criminal energy" on their part.

{{vm.hiddenReplies[39389] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 29 days ago, edited 23 days ago January 5, 2023 at 3:50:45 PM UTC, edited January 11, 2023 at 4:18:37 PM UTC link Permalink

Caps belong to "solution space" -- the discussion of how to address a problem. I wanted to stick to a more fundamental question: Do people agree that adding large quantities of near-duplicate sentences is bad for the corpus? Discussion of whether we should institute a cap, and what it should be, and how it should work, is irrelevant to answering that basic question. And I think it's important to start with the basics, because if you agree that what you're doing is bad for the corpus, you need to reevaluate what you're doing. It's extremely easy to refuse to face the consequences, or to rationalize them away. And when the discussion turns to technical means of discouraging bad behavior, it's easy for people to slip into the mode of "Well, I'll just keep doing it until the site stops me." The fact is that we have few developers and a lack of infrastructure for deciding on whether and how to implement a change, so chances are that it will occur a long time from now, if ever. And as you and others have pointed out, if such a change were instituted, people could use multiple accounts to get around it, and then all that discussion and development time would have been thrown away.

People don't need "criminal energy" to sabotage the site. Ideology, or self-deception, or self-aggrandizement, or hypocrisy will do just fine. My belief is that it is important for people to confront the consequences of their actions, and the sooner they recognize whether or not they're following their own principles or achieving their goals in the best way, the better. And those aren't just lofty principles, either. For instance, people should realize that while adding near-duplicate sentences to Tatoeba might be a shortcut to collecting the data they need for a GPS, it's not the best way, and meanwhile the mind-numbing repetitiveness of their sentences will make people hope to never see another Kabyle sentence in their lives.

{{vm.hiddenReplies[39399] ? 'expand_more' : 'expand_less'}} hide replies show replies
imalaqvayli imalaqvayli 29 days ago January 5, 2023 at 5:30:25 PM UTC link Permalink

Hi @AlanF_US

Completely agree that "near-duplicate sentences" are not useful for Tatoeba corpuses, by the way, we have we stopped including "near-duplicate sentences" and Igider started also to ones the ones already integrated, this will take some time to fix the duplicated old ones for kab, this will take time but it's worth it.

Behind this, I have another idea to improve this, since a user can integrate a "near-duplicate sentences" without knowing it since he can't check all already the sentences, do you know if it will be possible to integrate a "near-duplicate sentences" check when the user is integrating a new sentence? By showing to the user the possible "near-duplicate sentences", he will be able to decide if it is useful to integrate his new sentence.

{{vm.hiddenReplies[39401] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 23 days ago January 11, 2023 at 4:23:02 PM UTC link Permalink

I'm glad to hear that you have stopped including near-duplicate sentences.

The rule of thumb is that whatever functionality does not currently exist on Tatoeba will either take a very long time to be implemented or never appear. The number of developers is small, the number of things that need to be fixed is huge, and the number of divergent opinions on the advisability of any particular measure is also large, meaning that discussion will take a long time and often not lead to a result.

The best way to avoid adding near-duplicate sentences is to make an attempt to include variety. I give an example here:

https://tatoeba.org/en/wall/sho...#message_39414

If your sentences are sufficiently varied, and not extremely simple, they probably won't be near-duplicates of existing sentences.

{{vm.hiddenReplies[39417] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK 22 days ago January 13, 2023 at 3:22:06 AM UTC link Permalink

A lot of us do the following, while there are several members who don't like this idea.

Wildcards Used to Help Avoid Too Many Near Duplicates
http://bit.ly/tatoebawildcards

Here are a few examples of the kinds of near duplicates that could be avoided using this method.

These are just limited to a few proper nouns, person's name, city name, country name and language name.


**Proper Noun: Person's Name

[#9171210] Tom adopted a puppy. (CK)
[#11392303] Ziri adopted a puppy. (Amastan)

[#7785065] Tom became exhausted. (shekitten)
[#8830384] Skura became exhausted. (Amastan)

[#3150589] Tom didn't buy anything. (CK)
[#7268137] Sami didn't buy anything. (OsoHombre)

[#5062121] Tom doesn't think that's a good idea. (AlanF_US)
[#7135177] Sami doesn't think that's a good idea. (OsoHombre)

[#10119304] Tom ate the bug. (ddnktr)
[#7147707] Sami ate the bug. (OsoHombre)

[#2966807] Tom gave up smoking. (AlanF_US)
[#8794262] Skura gave up smoking. (Amastan)

[#8899999] Tom asked to speak to the manager. (shekitten)
[#7287227] Sami asked to speak to the manager. (OsoHombre)

[#7919112] Tom has a criminal history. (AlanF_US)
[#9388153] Yanni has a criminal history. (Amastan)

[#10553053] Tom was a very nice guy. (AlanF_US)
[#8003579] Mennad was a very nice guy. (OsoHombre)

[#1955120] Tom can't wait. (CK)
[#7156636] Sami can't wait. (OsoHombre)

[#2458647] When is Tom's birthday? (Hybrid)
[#7121120] When is Sami's birthday? (OsoHombre)

[#10247773] Tom accidentally set himself on fire. (ddnktr)
[#11190504] Ziri accidentally set himself on fire. (Amastan)

[#4382032] Tom cut the apple in two. (CK)
[#11266646] Ziri cut the apple in two. (Agestur)

[#6945900] Tom broke everything. (shekitten)
[#7245667] Sami broke everything. (OsoHombre)

[#10802492] Tom clung to a branch. (ddnktr)
[#6713011] Sami clung to a branch. (OsoHombre)

[#10109939] Tom didn't fight back. (ddnktr)
[#11177147] Ziri didn't fight back. (Amastan)

[#3448539] Tom works as an announcer on television. (AlanF_US)
[#9023943] Skura works as an announcer on television. (Amastan)

[#8235173] Tom bought something. (shekitten)
[#7199508] Sami bought something. (OsoHombre)
[#8054241] Mennad bought something. (OsoHombre)
[#10234984] Ziri bought something. (Amastan)
[#11422340] Rima bought something. (Agestur)

[#2236196] Tom denied it. (CK)
[#6698344] Sami denied it. (OsoHombre)
[#9797071] Yanni denied it. (Amastan)
[#10222066] Ziri denied it. (Amastan)
[#11385747] Rima denied it. (Agestur)

[#5149973] Tom denied this. (CK)
[#8795379] Skura denied this. (Amastan)
[#10222064] Ziri denied this. (Amastan)
[#11385729] Rima denied this. (Agestur)

[#2863468] Tom deserves it. (Amastan)
[#7148955] Sami deserves it. (OsoHombre)
[#10897449] Ziri deserves it. (Amastan)
[#11429964] Rima deserves it. (Agestur)

[#2236201] Tom did that. (CK)
[#6806928] Sami did that. (OsoHombre)
[#10026274] Yanni did that. (Amastan)
[#10208200] Ziri did that. (Amastan)
[#11287796] Rima did that. (Agestur)

[#2549619] Tom dialed 911. (CK)
[#10207372] Ziri dialed 911. (Amastan)
[#6462067] Sami dialled 911. (OsoHombre)
[#11413579] Rima dialled 911. (Agestur)
[#11413576] He dialled 911. (Agestur)
[#11413577] She dialled 911. (Agestur)
[#11413583] They dialled 911. (Agestur)


** Proper Noun: City Name

[#2806248] Tom was raised in Boston. (AlanF_US)
[#10222042] Ziri was raised in Algiers. (Amastan)

[#2045784] Boston is a beautiful city. (CK)
[#8104935] Algiers is a beautiful city. (Amastan)

[#4811756] Boston is a big city. (CK)
[#8396376] Algiers is a big city. (Amastan)

[#9817056] Boston is a fascinating city. (CK)
[#8580221] Algiers is a fascinating city. (Amastan)


** Proper Noun: Country Name

[#7192180] Do you feel safe in Australia? (CK)
[#8567684] Do you feel safe in Algeria? (Amastan)

[#7137416] Do you like Australia? (CK)
[#8512926] Do you like Algeria? (Amastan)

[#7192158] Do you live in Australia? (CK)
[#8417301] Do you live in Algeria? (Amastan)

[#7192156] Do you miss Australia? (CK)
[#8313843] Do you miss Algeria? (Amastan)


** Proper Noun: Language Name

[#9266614] I want to speak French fluently. (shekitten)
[#10132507] I want to speak Spanish fluently. (Ricardo14)

[#9284403] It's important to study French. (shekitten)
[#9284497] It's important to study Russian. (shekitten)

[#2451464] Do you have a French dictionary? (CK)
[#8314009] Do you have a Berber dictionary? (Amastan)

[#8410238] Does anybody speak French here? (CK)
[#7865035] Does anybody speak Berber here? (Amastan)

[#2451515] Does anyone here speak French? (CK)
[#6317360] Does anyone here speak Russian? (carlosalberto)
[#11066331] Does anyone here speak Portuguese? (sundown)

{{vm.hiddenReplies[39426] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir 22 days ago January 13, 2023 at 7:35:36 AM UTC link Permalink

Ja kuten aiemmin mainittu, on tämä huono ajatus: kaupunkien, valtioiden ja kielten nimet vaihtelevat kielestä toiseen, eri nimet muuttuvat eri tavalla aakkostosta toiseen ja eri nimet taipuvat eri lailla eri kielissä. Lisäksi on hyvä, jos kussakin kielessä esiintyy myös sille tavanomaisia erisnimiä, myös kielenoppijoiden kannalta.

Yorwba Yorwba 21 days ago January 14, 2023 at 12:06:49 PM UTC link Permalink

How often do you get the message that the sentence you were trying to add already existed? In other words, how often does always using the same names prevent you from adding a near-duplicate? And what do you do when that happens?

I only get that message when I'm adding a translation to an existing sentence that was already indirectly linked via a few corners to the same translation that I came up with; I don't think I've encountered it with an original sentence I thought up myself, but maybe that's because I haven't added all that many.

shekitten shekitten 29 days ago, edited 29 days ago January 5, 2023 at 5:11:45 PM UTC, edited January 5, 2023 at 5:14:54 PM UTC link Permalink

> Adding large quantities of near-duplicate sentences degrades the quality of the Tatoeba corpus.

What standard is being applied here to determine the quality of the Tatoeba corpus? What are you basing this judgment of quality on? Is this just your opinion?

{{vm.hiddenReplies[39400] ? 'expand_more' : 'expand_less'}} hide replies show replies
Polgar1 Polgar1 26 days ago January 9, 2023 at 12:16:25 PM UTC link Permalink

To my understanding, all of this is up to discussion - preferably, based on reasoning that others can agree on, or at least understand.

shekitten shekitten 23 days ago, edited 23 days ago January 11, 2023 at 5:12:48 PM UTC, edited January 11, 2023 at 5:15:39 PM UTC link Permalink

I'll take your lack of a response as evidence that this is purely your opinion.

So why do people have to take the opportunity to explain to you that near-duplicate sentences enhance the quality of the corpus, when you have no basis (other than your own beliefs) for claiming the opposite? And when you set preconditions excluding almost every argument for why they enhance the quality?

{{vm.hiddenReplies[39418] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 20 days ago January 14, 2023 at 7:22:08 PM UTC link Permalink

I agree with you and I would like to support name diversity and free creation of sentences, including so-called "near-duplicates".

Here are the reasons: I use sentences from open sources projects to make audio files and text tables. Listening and reading them helps to learn languages without digging into grammar rules. When you learn a word, it's important to learn different forms of it, not only its main form. Introducing "I learnt Berber", "You learnt Berber", "She learnt Berber", etc. helps you understand the basics of grammar of your new language. That's natural for a human and helps to learn proper endings and inclination.

For the name diversity: finding just "Tom" and "Ziri" here and there while reading/listening sentences is just boring.

{{vm.hiddenReplies[39431] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba 20 days ago January 14, 2023 at 10:14:24 PM UTC link Permalink

> Introducing "I learnt Berber", "You learnt Berber", "She learnt Berber", etc. helps you understand the basics of grammar of your new language.

Do you prefer a series of sentences that differ in only one word, or would

I learnt Berber.
You learnt that at school, I hope.
She learnt how to spell her name.

also be acceptable?

{{vm.hiddenReplies[39432] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 19 days ago January 15, 2023 at 3:26:34 PM UTC link Permalink

>>Do you prefer a series of sentences that differ in only one word

It depends. For a completely new language with difficult grammar, mixing up sentences like "I learnt Berber", "I learnt English", "I learnt French", "He learnt Berber", "He learnt French", "She learnt German", "Ivan learnt Berber", would be my choice to understand the inclination and learn languages' names.

Of course, you sentences are also valuable and useful. Tatoeba is a collection, so it may consist of sentences of different sorts serving different aims. It's not a dictionary, which main purpose is illustrating different meanings of words.

Thanuir Thanuir 29 days ago January 5, 2023 at 7:30:42 PM UTC link Permalink

Yksittäisen lauseen käännöksillä, vaikka ne muistuttaisivatkin toisiaan, on arvoa.

1. Ne näyttävät yhdellä silmäyksellä kuinka moninaisilla tavoilla erään lauseen voi kääntää.
2. Ne tukevat monimuotoisuutta esimerkiksi persoonapronominien sukupuolen ja yksiköllisyyden/monikollisuuden suhteen.

Lisäksi niillä on sekä positiivinen että negatiivinen vaikutus linkittämiseen; negatiivinen sinänsä, että jos joku kääntää yhden lauseen, ei hän välttämättä linkitä samankaltaisia lauseita, eihän käyttöliittymä anna tähän hyvää mahdollisuutta. Positiivinen sinänsä, että jos joku on vaikkapa suomentanut ruotsinkielisen lauseen ja toinen ranskankielisen, ja näillä olisi mahdollinen yhteinen käännös, siihen osutaan todennäköisemmin jos molemmilla lauseilla on useita käännöksiä.

{{vm.hiddenReplies[39402] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cangarejo Cangarejo 28 days ago January 6, 2023 at 8:35:59 PM UTC link Permalink

@Thanuir, the sentences being added at a large scale don’t have any translations.

{{vm.hiddenReplies[39404] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir 28 days ago January 7, 2023 at 8:16:29 AM UTC link Permalink

Tiedän. Ymmärsin tämän kysymyksen kuitenkin yleisenä, en vain niitä lauseita koskevana.

Polgar1 Polgar1 26 days ago January 9, 2023 at 12:26:27 PM UTC link Permalink

I think this is agreeable, once we clarify two terms: "large quantity" and "near-duplicate".

If by "near-duplicate", we mean sentences that have basically the same meaning with basically the same grammar (i.e you replace a name in an English sentence), I'd say basically any quantity of such "near-duplicates" is just noise. Even a third sentence that just substitutes an undeclined proper name to an undeclined proper name is too much.

However, if it's a translatable word that is being substituted in, or there is some grammatical diversity between the words substituted in, I would be more lenient and therefore would rather focus on the "large quantity" part. It's okay if there is a sentence that illustrates the meaning and usage of several words in a certain context - the important thing is that 1. it shouldn't try to cover "all words" in the given context (because then the corpus is mixed with a boring dictionary) 2. if something can reasonably be deduced from an example, don't create analoguous examples for other sentences.

Loosely, I could say that for me "large amount" means generating sentences based on a logic, either manually or programmatically.

{{vm.hiddenReplies[39408] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx 25 days ago January 9, 2023 at 1:13:07 PM UTC link Permalink

> generating sentences based on a logic, either manually or programmatically.

For my part, I call the "original" sentences added using this method "patterned sentences". Here are some of the many examples added today by Amastan/Agestur:

They learnt some Berber.
We learnt some Berber.
Rima and Skura learnt some Berber.
Ziri and Rima learnt some Berber.
Rima learnt some Berber.
She learnt some Berber.
He learnt some Berber.
I learnt some Berber.

Unfortunately, this prolific contributor seems to refuse to communicate with us on this Wall 😞

{{vm.hiddenReplies[39409] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 23 days ago, edited 23 days ago January 11, 2023 at 3:54:26 PM UTC, edited January 11, 2023 at 4:11:53 PM UTC link Permalink

My hope is that rather than adding these sentences:

They learnt some Berber.
We learnt some Berber.
Rima and Skura learnt some Berber.

a contributor would add these:

They learnt some Greek while they were on vacation.
We learnt some cooking skills by watching YouTube videos.
Rima and Skura learnt valuable life lessons the hard way.
Everything they learnt was wrong.
Our mother, we learnt later, had been saving money in a separate account.

Regarding your last remark: If you have trouble getting a contributor to communicate on the Wall, try sending them a private message. In many situations, private messages are better in the first place.

morbrorper morbrorper 23 days ago January 12, 2023 at 10:11:06 AM UTC link Permalink

I think the opposition to near-duplicate sentences can be explained by the term "noise-to-signal ratio"; I hope we can agree that we want sentences that increase the signal level and decrease the level of "noise".

Using this criterion, I would say that the "some Berber" sentences are increasing the noise level, as opposed to Alanf_US's suggestions. Imagine a learning tool that takes it corpus from Tatoeba; it will have to filter out all these repetitive sentences as "noise" in order to be useful.

lbdx lbdx 24 days ago, edited 23 days ago January 10, 2023 at 8:00:04 PM UTC, edited January 11, 2023 at 4:28:59 PM UTC link Permalink

** off-topic post deleted by the author **

{{vm.hiddenReplies[39410] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo 24 days ago January 10, 2023 at 8:15:58 PM UTC link Permalink

Ask it again, but before that ask it to define what is the difference between data and information.
And then also ask it to tell you what those near-duplicates are: data or information?

lbdx lbdx 24 days ago, edited 23 days ago January 10, 2023 at 8:48:11 PM UTC, edited January 11, 2023 at 4:29:12 PM UTC link Permalink

** off-topic post deleted by the author **

{{vm.hiddenReplies[39412] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 23 days ago, edited 23 days ago January 11, 2023 at 3:47:34 PM UTC, edited January 11, 2023 at 3:48:08 PM UTC link Permalink

Entertaining though it might be to see what AI-generated text has to say, I hope this is not the wave of the future. It's hard enough to find time to participate in a conversation with humans. Sorting through AI-generated text makes things even more difficult. Using an AI engine to come up with ideas on your side is one thing but asking a reader to guess which parts of an automatically generated wall of text you actually stand behind, or have written yourself, is another.

Often, AI-generated text doesn't apply to the specifics of the situation. For instance, item 2 ("Variety of context: Even though the sentences are similar, they could have different context and this way, it can give users a better understanding of how the language is used in different situations and situations") doesn't apply to Tatoeba. The whole point is that sentences here appear without external context, and a set of near-duplicate sentences fails to provide any more internal context beyond what exists in any one of those sentences. Thus, people are unable to reason about what variety of contexts might be valid for a particular usage.

{{vm.hiddenReplies[39413] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx 23 days ago January 11, 2023 at 4:04:04 PM UTC link Permalink

@Alanf_US Sorry, but I was very impressed with the result and thought other members would be curious as well. I will delete everything.

{{vm.hiddenReplies[39415] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 23 days ago, edited 23 days ago January 11, 2023 at 4:13:59 PM UTC, edited January 11, 2023 at 4:14:59 PM UTC link Permalink

I didn't expect you to delete your posts, but I appreciate your openness to and respect for my viewpoint.

Cangarejo Cangarejo 22 days ago January 12, 2023 at 1:58:26 PM UTC link Permalink

How is Tatoeba doing in terms of storage space?

{{vm.hiddenReplies[39421] ? 'expand_more' : 'expand_less'}} hide replies show replies
DJ_Saidez DJ_Saidez 22 days ago January 13, 2023 at 5:54:29 AM UTC link Permalink

I wonder too. Possibly this might have an impact on how Tatoeba has been crashing a lot recently for me.

{{vm.hiddenReplies[39427] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba 21 days ago January 14, 2023 at 12:28:54 PM UTC link Permalink

We actually ran out of disk space on 2022-12-10 and had to delete some files that were no longer needed, but that was due to inefficient usage of the space we did have available. We're currently using about 80 GiB of storage, so there's lots of headroom for growth before we even get close to the limit of what can fit into a single server.

When you see the "Tatoeba is currently unavailable" page intermittently, that's more likely due to rate limiting (if you request more than one page per second, the server will try to slow you down) or heavy load on the server when many users try to access it simultaneously.