** Near-duplicate sentences **
In Tatoeba we generally don't mind having near-duplicate sentences. We understand that in some circumstances, they are quite useful. For instance, "She loves listening to music" and "He loves listening to music" are near-duplicate sentences, but they are useful in case they are translations of a sentence that doesn't have a clear indication of the gender of the subject.
However, that doesn't mean we want to have too much of them. While it is difficult to define what exactly is "too much", I've been concerned about the growth of near-duplicates in our corpus.
I'm wondering: is there a lot of people who are have been annoyed, or are starting to be annoyed by near-duplicate sentences in Tatoeba and feel that they are a nuisance to the corpus? If you are one of these people, I'm curious to know how exactly they impact you.
I'm also curious to know what people feel are bad or good practices when it comes to creating near-duplicate sentences.
- Have you ever found yourself in situations where you felt that creating near-duplicates was the right thing to do? How did you evaluate it was the right thing to do?
- Have you ever seen people create near-duplicates and you felt that it was wrong? Why did you feel it was wrong?
Please share your thoughts :) And feel free to send me a private message in case you're feeling a bit too shy for the Wall.
When translating from Spanish to English, I get plenty of ambiguous sentences where I could either translate with a he/him subject or a she/her subject
In keeping with CK’s practice of using Tom, a male character, by default, and later merely by my own convenience, I translated these ambiguous sentences only with a he/him version unless it makes sense to do otherwise, although both sentences would be correct translations.
I’m not particularly bothered by near-duplicate sentences honestly. If it’s done maliciously, that’s annoying, but if it’s unintentional or serves an actual purpose, that’s fine for me.
Also, something I want to mention
The argument that near-duplicates showing different grammatical constructs of the same message (ie. She instead of He) are unnecessary because students will already know enough to convert for themselves, is a narrow-minded perspective in my opinion, because this is catering to a very specific audience to whom this is referring to. That specific audience will be able to use the sentences so, but what about all the other users that aren’t in the same class?
I know this system isn’t really hard to adjust to by many means, I just don’t see the need to cater your activity on this project only to the standards of a sub-project, and constantly encourage/suggest that others follow the same standards of this sub-project.
Edit: change of wording
> The argument that near-duplicates showing different grammatical constructs of the same message (ie. She instead of He) are unnecessary because students will already know enough to convert for themselves, is a narrow-minded perspective in my opinion, because this is catering to a very specific audience to whom this is referring to.
It's also very English-centric. It's trivial to convert He -> She in English, but not in Hebrew or Arabic where even verbs take different forms depending on gender.
> I know this system isn’t really hard to adjust to by many means, I just don’t see the need to cater your activity on this project only to the standards of a sub-project, and insist that others follow the same standards of this sub-project.
> It's also very English-centric. It's trivial to convert He -> She in English, but not in Hebrew or Arabic where even verbs take different forms depending on gender.
Converting sentences that use "he" to ones that use "she" may be nontrivial in Hebrew, but it's not particularly complicated, either. Learning a handful of forms takes a finite, relatively small amount of time compared to learning vocabulary, where words number in the tens or hundreds of thousands. That's where Tatoeba could shine, and could serve a purpose that other sites don't. Expanding all sentences according to simple paradigms (person, number, gender) caters towards absolute beginners, who would learn better at a place that offers a guided path, while making it harder for everyone else to find the variety they need. Sure, there's a place for it, but it doesn't need to be done everywhere, all the time. And once a person learns that "anyone" is synonymous with "anybody", the hundreds of pairs of sentences that differ only whether they use one word or the other (clearly added to pad the owner's number of total sentences) subtract value rather than adding it. The same goes for the fact that "that" can be omitted or included when it introduces a clause.
Just to be clear, we're talking about sets of sentences like these:
I can't remember all my passwords, so I write them all down in a notebook that I keep hidden where I don't think that anybody can find it.
I can't remember all my passwords, so I write them all down in a notebook I keep hidden where I don't think anyone can find it.
I can't remember all my passwords, so I write them all down in a notebook I keep hidden where I don't think anybody can find it.
I can't remember all my passwords, so I write them all down in a note
book I keep hidden where I don't think that anyone can find it.
I can't remember all my passwords, so I write them all down in a notebook that I keep hidden where I don't think anybody can find it.
I can't remember all my passwords, so I write them all down in a notebook that I keep hidden where I don't think that anyone can find it.
I can't remember all my passwords, so I write them all down in a notebook that I keep hidden where I don't think anyone can find it.
I don't think that there's anybody in the world who's done what I've done.
I don't think there's anyone in the world who's done what I've done.
I don't think that there's anyone in the world who's done what I've done.
I don't think there's anybody in the world who's done what I've done.
I don't think that anybody really thought that Tom was as rich as he said he was.
I don't think anybody really thought that Tom was as rich as he said he was.
I don't think anyone really thought that Tom was as rich as he said he was.
I don't think that anyone really thought that Tom was as rich as he said he was.
I don't think anyone really thought Tom was as rich as he said he was.
I don't think that anyone really thought Tom was as rich as he said he was.
> The argument that near-duplicates showing different grammatical
> constructs of the same message (ie. She instead of He) are unnecessary
> because students will already know enough to convert for themselves, is a
> narrow-minded perspective
I've never heard that argument :) Just to be clear, when I gave the example of having near-duplicates with He/She, it was just to illustrate a case where it's been long agreed to be useful. It was not to try negating arguments from anyone who might think such near-duplicates are unnecessary.
The issue is not about the near-duplicates themselves but the quantity of near-duplicates.
As a thought experiment, let's imagine that someone would take every existing sentence containing "he", copies them and replaces "he" with "she" (you can also imagine other pair of words such as this/that or everyone/everybody). Or let's imagine that someone uses the pattern "I live in XXX" and creates tens of thousands of sentences where they just replace XXX by every possible names of places they can think of (countries, cities, regions, districts, etc). I don't expect the community to be very thrilled about that.
No one went to such lengths, thankfully, but there's been patterns of contributions that have been heading towards such extremes. What I'm trying to figure out is where should we draw the line? How do we figure out what's a healthy amount of near-duplicates for Tatoeba? What can we do to maintain a healthy amount of near-duplicates?
Yeah I don’t think that we should encourage an infinite amount of near-duplicates (like using “I live in XXX” and replacing XXX with every possible place you can think of, which could just as well be generated by a robot, since it decreases the number of sentence structures and possible messages.
One exception IMO is if it would be to fulfill an actual need, for example the developers of a Kabyle GPS app who need audio for sentences talking about directions and information across countless places throughout the entire region, although it could be argued whether that could be done elsewhere.
But I think it’s important to show the different ways that something could be expressed, because you can’t expect every speaker of a certain language to express a specific idea the exact same way. Take German or Turkish, for example, where word order isn’t very strict. If we only add one or two translations based on an original English sentence, for example, and a learner believes those are the only ones they need, they might be surprised if native speakers they encounter use options they didn’t find on here, and which they might assume to be wrong or just feel discouraged at the confusion.
As shekitten mentioned, this is very English-centric, and the argument that there’s no need to demonstrate every single construct (ie. inflections in other languages based on different terms) here if it doesn’t use the wildcards because learners will learn that elsewhere or would’ve already known them, honestly decreases the potential of the Tatoeba corpus and is possibly even counter-productive to what I think the mission is.
Also, having a FEW places other than Boston used as examples could be helpful to show how different sounding places are conjugated in different languages, and also to make it less monotonous. But not if it detracts from making new original constructs, or if it leads to its own monotony.
As long as they're really used and they sound natural, I would even encourage them. In some languages the sentence structure is very free or depending on the word order the stress can be put in one word or another, and I consider showing that important for the learners. The same goes for the fact that "you" as subject has at least 6 possible translations in Spanish, so why choosing one and not the other when all would be used differently depending on the context or the user?
I rarely add near-duplicates myself because it's a lot of work to add all possible alternatives but I don't see the problem with them.
Eu realmente acredito que as chamadas "near-duplicated sentences" possuem um papel brilhante no Tatoeba na maioria das vezes.
Como Shishir disse, em português, por exemplo, "you" pode ser traduzido por "você", "vocês", "tu", "te", "lhe", "lhes" e de várias outras formas.
Em grego, o nome próprio "Marina" seria transliterado de uma forma diferente de "Betty" que, por sua vez, seria transliterado de uma forma diferente de "Rocío" e assim vai.
Uma coisa que me preocupa é ter trocado intencionalmente uma sentença legítima por uma que foi "estandartizada". - Logo, o que vemos um monte de frases "semi-iguais" porque alguém tem medo demais de frases "near duplicated.
Um ponto que a Shishir também tocou: Uma vez que uma frase soe natural, ela já é válida no Tatoeba. Devemos nos lembrar disso e usar com sabedoria. :)
> so why choosing one and not the other when all would be used differently
> depending on the context or the user?
Because if we push this too far, it can have negative impacts. Alan and maaster have illustrated it in their replies. Of course not everyone may be as affected as they are with the current amount of near-duplicates, and perhaps for the large majority of Tatoeba users there's no issue right now. But it can still grow into a wider an issue.
If we take the case of "you" that can be translated in 6 different ways in Spanish, do you think it would be a better strategy to alternate the way you translate "you" from a sentence to another, rather than translate in all possible ways on every sentence? I can imagine that's more or less what you are already doing anyway, as you've said that you rarely add near-duplicates.
As far as I'm concerned, if I would search for sentences containing "you are" translated into Spanish, I wouldn't need every sentences to have 6 translations in Spanish to represent the 6 different ways "you" can be translated. It would be enough that each sentence only has one Spanish translation. As long as the results I'm viewing show all the different ways that "you are" can be translated, I would be able to guess that "you are" can be translated in various ways into Spanish. And therefore, if you would be spending time adding 6 translations to all the sentences with "you are", it would be somewhat time wasted.
Generally speaking, illustrating any feature of a language doesn't need to be done through near-duplicates. It can be done through the repetition of a pattern. Near-duplicates are just the path of least effort when it comes to generating patterns, but they generate extra clutter and inflate the size of the corpus without necessarily adding a whole lot of value.
In that sense, would it be fair to have, in our contribution guidelines, a statement asking contributors to avoid creating near-duplicates intentionally? Which doesn't mean we must stop near-duplicates completely, but just let them grow passively. That is, just wait till they are created unintentionally rather than actively creating them.
Actually what I do and what most people seem to do is take one of the six possibilities and stick to it. In my case, I would translate "you are clever" as "eres listo" (masculine, singular, informal and with omitted subject), and if it's obviously plural I'd use the informal form and also masculine. And this is what I'm afraid of: if we stop people from adding near duplicates, we might have no (or very few) representations of the other possibilities. Of course, things like this might cross the line even for me https://tatoeba.org/es/sentence...moschee%22&to=
Shishir, you've described how you (and other people) currently do things, namely by always translating to a particular form (masculine, singular, informal), which also happens to be the base form selected by dictionaries. While I personally find this preferable to always translating to multiple forms, I agree with you that it can lead to underrepresentation of those other forms.
So Trang has asked whether we can modify our behavior slightly, randomizing things a bit, so that you might translate "you are clever" as "eres listo", but then translate "you are foolish" as "ustedes son tontas" (feminine, plural, subject included). This admittedly takes a bit more mental energy, but only a tiny bit. Then we'd avoid the underrepresentation problem, and also make the corpus more interesting. Would this be something that you would consider?
Trang is not talking about *stopping* people from adding near-duplicates, but about encouraging them to add sentences that differ more substantially. She is definitely not talking about adding 120 sentences along the lines of "We are going to build a mosque in <country>." For instance, one could instead write "We are going to build a mosque in Belgium", "They are going to build a church in Turkey", "My ancestors built a synagogue in India", "The Buddhists rebuilt a temple in China." And then stop there and move to a new pattern. That way, there's a little bit of commonality, which makes the sentences easier to write than starting with a brand-new idea every time, but there's also some variation.
I think you misunderstood my point.
1- I rarely translate from English, but from Turkish, German, Polish and Chinese mainly. That was just an example so that everybody can follow my point.
2- if you search for no matter what sentence in English that starts with “you” and its translations into Spanish, you won’t find many in which it’s not translated with the “tú” version unless it obviously refers to more than one person. So it’s not that *I* should randomize my choice of subjects but that *most Spanish speakers* should.
3- I think this would go against the “If you feel there are several possible translations, you can add several translations in the same language. ” message in the translating page (https://tatoeba.org/en/activiti...te_sentences), since most new (and not so new) members, when they try to add several possibilities they tend to be near-duplicates, so then that should be changed or erased, too.
4- I think you’re seeing things from an advanced learner point of view and not from a beginner point of view, and I thought Tatoeba was a place for both. For a beginner, as Selena said, the more examples, the better.
5- In some cases the grammatical use in different countries might be different, and what looks for me completely correct and natural might look wrong for someone from Latin America and vice versa, so having only one of these options would make this sentence useless for someone who is learning the other variety. If we follow this “no near duplicates” rule/suggestion, it would all depend on where the most active translators are from.
6- There are cases in which In Spain we have “leísmo”, (using indirect object pronouns for direct objects) which, despite being accepted, could be confusing for learners who don’t know whether a verb goes with direct or indirect object. If they see both options, however, they can figure out that it’s “leismo” and that the verb goes with direct object. (I saw him – yo le vi / yo lo vi).
7- I didn’t say that Trang wanted those 120 sentences that I sent as an example, I said that in that case I do agree with her. Maybe my point is that we shouldn’t add near duplicates as *original sentences* but I consider enriching to have several translations.
Thanks for the explanation, Shishir. The “leísmo” example is especially valuable. I have some thoughts in response, but rather than continue to scatter my comments, I think that I'll put them in one place.
I found near-duplicate sentences rather useful for beginners, and would like to see more of them, especially with different country/city names.
Now I'm making audiofiles based on Tatoeba's sentences, and hearing "Tom" and "Mary" here and there is quite annoying, so I need a program solution to replace them with random names. Fortunately, in German it's trivial (from grammar point of view), but if names have inclination in a particular language it's more tricky to do.
In the other hand I understand that near duplicates could be annoying for advanced learners, so they must have a possibility to skip those sentences while searching.
There are many kinds of near-duplicate sentences, and their generation, particularly on an industrial scale, greatly reduces the usefulness of the Tatoeba Corpus. There's nothing wrong with creating some near-duplicates. There are many problems caused by generating duplicates at every possible opportunity. They cause issues both here and with the people who use our corpus downstream, who need to work harder to find interesting sentences that are not just near-clones of other sentences, that have unique features of their own.
- are boring
- make it harder to find and link indirect translations, since direct translations are scattered over a greater number of individual nearly identical sentences
- serve to encourage the generation of more near-duplicates, exacerbating whatever problems they cause
- tend not to produce interesting, illuminating links between words in the same sentence, since such sentences cannot be written via mass assembly
And I disagree with the idea that there is anything English-centric about labeling these near-duplicates a problem. They're a problem in every language.
Just to make sure we're talking about a variety of kinds of near-duplicates, I've seen the following:
- near-synonyms: "everybody/everyone", "anybody/anyone"
- pairs where "that" is present in one, but absent in the other ("I know you're mad", "I know that you're mad")
- expansion across person and number "They see me", "They see you", "They see him", "They see her"
- replacement of place names in otherwise identical sentences ("We went to Boston", "We went to Paris")
- sentences of the form "Tom is <adjective>", which tell the reader nothing about the adjective
The whole thing about Tatoeba is that it is not a textbook, where there's a purpose to showing, say, the full range of pronouns, or the full range of past-tense forms, on a single page, in a single chart. It's a way to find out how language is used, in all its richness. There are plenty of other places where one can find conjugation paradigms. And those are better places for people to go if they want to learn them. We should focus on providing an interesting, varied collection of sentences.
Most likely, near-duplicate sentences are short sentences. (John is a person. Fatima is a person. Jean-Francois is a person.) The question is: Do we still need dozens of short sentences? Yes. We can more easily add five billion sentences in this way; the Tatoeba-Corpus, English-Corpus etc. will be fu*king big.
No. Thousands of those sentences are boring and in fact useless.
Now, I find three sentences among hundred worth to translate. Nonetheless, it takes too much time to find them.
There are plenty of languages where the "trivial" change from he to she (or it!) causes many important grammatical changes in a sentence and practicing all of these is vital to learning a language correctly.
Many people understand that in the Latin languages adjectives and some past participles must match the gender (and number) of the subject.
How many know that in Welsh the mutation applied to the possession depends on the gender of the owner? These so-called trivial changes are very important and need to be practiced.
Dw i'n gweld ei gath - I see his cat
Dw i'n gweld ei chath - I see her cat
A trivial change in English, but not in Welsh.
I don't speak any Slavic languages, but I have been told that there are more than three genders and that these affect more than adjectives, so again the opportunity to practice is necessary.
I'm with Shishir and Saidez on this one. "Near-identical translations are welcome but not welcome if there is a lot" doesn't really sit with the message that members first see: namely that you can add as many translations as you feel like.
Mind you: I personally wouldn't even bother copy-pasting over dozens or even hundreds of sentences, however, I don't agree with the sole idea that correct sentences can make a corpus worse. If there is an issue with those sentences, it's solely a technical issue! We have labelling, we have the count of direct and indirect translations, we have the history of the sentences. Once one has a simple way to organize sentences, they simply cannot make anything worse.
Keep in mind that there are very different kind of users on Tatoeba. I came from Clozemaster (probably not the only one) and I know someone who started to use the corpus for building a similar app (mostly for personal use). Of course they are facing issues (e.g because of multi-language sentences and niche stuff like that) but that's their issue to deal with. The good thing about Tatoeba is providing the framework and giving the opportunity to sort out what one possibly doesn't need, not policing the content for some preset goal.
Also, I feel that as much as it is good to have a summary of thoughts here on the forum, we could make use of the new Discord server in these scenarios, to facilitate understanding each other - so that less posts go back and forth that only makes this wall harder to read. Here is the invite link again: https://discord.gg/y3QwKdZ3PV
Yeah the chat structure of Discord could speed up these discussions possibly, and then we can post the summary of our discussions here or something
Not sure about others, but this topic here is typically the kind of topic that I would rather discuss in a forum than in a chat. It's not a discussion that needs speed as much as it needs good insights.
In a chat, I would feel pressured to provide more immediate responses, at the cost of them being incomplete or not very well-thought.
I don't want to discourage anyone to use a more instantaneous form of communication though. If that works better for others, by all means, go for it. But please, do post a summary for those who don't have a Discord account or don't feel like creating one just for Tatoeba :)
Please allow me to give you another perspective :)
> "Near-identical translations are welcome but not welcome if there is a lot"
> doesn't really sit with the message that members first see: namely that
> you can add as many translations as you feel like.
That's true, but this message is not set in stone. We can always readjust it to be more in line with the idea that one shouldn't to go overboard with near-duplicates. Something along these lines: "If you hesitate between several translations, know that you can add them all. However, there is no need to add every possible translations you can think of, to every sentence you translate. When there are several ways to translate a certain pattern, we generally recommend to choose just one translation but alternate the way you translate this pattern from a sentence to another."
> I don't agree with the sole idea that correct sentences can make a
> corpus worse.
Correct sentences won't make a corpus worse by your standards, but they can still be an issue. If you stay around long enough, you will for sure face a situation one day where you feel that someone is contributing in a way that just doesn't feel right to you.
Typically, we've had instances with people adding large amounts of bot-generated sentences. They were correct, but didn't sound very natural. A lot of members complained about that. We've also had instances where people added personal attacks towards other members, or insulting sentences towards political figures. Again, they were gramatically correct, but they caused issues. Same for pornographic sentences. Same for copyrighted sentences. All correct, but all problematic.
> If there is an issue with those sentences, it's solely a technical issue!
I very much agree that it all boils down to technical issues, but until technical solutions are implemented, massively adding problematic sentences would be quite impolite, to say the least.
The most respectful thing to do would be to first wait for stable technical solutions to be implemented, and only then, one can go ahead and all the sentences they want that used to be problematic but no longer are.
> The good thing about Tatoeba is providing the framework and giving the
> opportunity to sort out what one possibly doesn't need, not policing the
> content for some preset goal.
I think it should definitely be Tatoeba's goal to be a platform where anyone can contribute all kind of sentences with no limitations (or as few as possible), but we're still far from it.
The reality about Tatoeba today is that it doesn't provide a full-fledged set of features for people to sort out what they possibly don't need. For all I know, it could take another ten years till we get there and during this time we cannot operate as if the necessary features were going to be rolled out tomorrow.
That being said, if someone wants to work on a technical solution to help users filter out near-duplicates, I have to remind that Tatoeba is an open source project and we're always more than happy to receive pull requests :)
And I want to emphasize that I'm not trying to push for a policy that would ask people to avoid adding near-duplicates. I'm not saying that such a policy would be a great idea but I'm suggesting it because it is an idea that I think is worth exploring (and to be honest, I don't really have any better idea). If anything, it can help identifying more clearly situations where near-duplicates are very much appreciated and situations where they cross the line.
Thank you for your response. There are things we agree about but there are also things where we are set to disagree.
Bot-generated sentences, controversial sentences and copyrighted content are all completely different issues. I don't think any of them should be strawmanned into very similar sentences written in good faith. I personally think even controversial sentences could be allowed to a greater extent and surely, I doubt by creating Tatoeba, your goal was to collect sentences from people to suit your personal needs...
Indeed, I personally hope I can contribute to the project but I'm not quite there. Neither with my understanding of the project, nor with my resources. Still, what one can do with ease if there is need is to sort off extra translations over an amount. Then there can be policies: max n random translations, first max n translations, max n translations with the least common words etc.
I think what we can all agree on is that bot generated sentences aren't what Tatoeba should be about. Near-identical sentences are likely to remain a complex and dividing topic, on the other hand.