I have 2 questions for contributors who translate from English into their languages.
1. Do you think that it isn't useful for me to contribute English sentences with the same meanings and that I should stop doing this?
2. If you like the idea of having various English sentences with the same meaning, do you think that sentences that are interchangeable should be directly linked to each other?
◼ ◼ Question 1 Details
It's been suggested by two people that it isn't useful for me to contribute English sentences with the same meaning.
[#8355426] Tom bought Mary a dress. (CK) *audio*
[#8355425] Tom bought a dress for Mary. (CK) *audio*
[#8589486] Tom has brains. (CK) *audio*
[#1024870] Tom is smart. (CK) *audio*
[#1024985] Tom is intelligent. (CK) *audio*
[#1025488] Tom has a good head on his shoulders. (CK) *audio*
... and many others, some are word order differences, some are vocabulary differences, some have additional words for clarity. In many cases, either or all versions of the sentences are very close in terms of frequency of use.
◼ I've already asked a few members who have translated such sentences this question. Here are their replies.
I am not offended by that at all and I think that it is entirely ok and even useful.
It is always good for learners to know all the variants of the same sentence, if possible: with or without conjuction, using different word orders/ patterns, etc.
I do the same at translating, albeit I don't add all the possibilities every time.
> I wonder if you, too, think this isn't useful and that I should stop doing this.
On the contrary, I find adding alternative translations useful; not only for learners, but also for native speakers. It's one of my favorite things on Tatoeba.
Well English has a lot of this. I think that, I had I would I'd, etc.
I think it's useful to have correct translations, because I never know where to place the object when using those darned phrasal verbs. I can understand however that it's irritating to translate several times the same thing, but I think that's a problem inherent to Tatoeba. Also, maybe they had this reaction because you added them as original sentences, but if they were added as translations their reaction would be different? I don't know. I guess arguments can be given for both sides.
I think that's extremely useful. That's one of the things that distinguishes Tatoeba from
other bilingual or multilingual corpora.
Here's a lot of Danish translations of the sentence "Tom was very drunk.":
2 sentences with the same meaning is a kind of paraphrasing, I think.
(The question was: I wonder if you, too, think this isn't useful, and that I should stop doing this.)
I don't think so, CK.
I think the best thing to do is to try to steer a middle course. Do neither of the extremes: neither completely forego adding them, nor add them in every possible case.
[#8592554] I think that I can trust you. (CK) *audio*
[#8592555] I think I can trust you. (CK) *audio*
This is a very common thing to say. The word "that" may be omitted by some people while others leave it in. I think it's worth having both, but should we systematically add a variant to every possible sentence of this type? Can we have a clear conscience about leaving it to chance whether someone else will add a possible variant of a sentence we have just written? I think each of us has tried to add a sentence that already existed, or found out that there were variants out there. It comes with a little extra work, but I don't think it matters so much. I think it's OK. We can let it happen. :)
If I was given a choice whether I wanted you to add "I think that I can trust you" to "I think I can trust you" or come up with something new like these ...
[#8594680] Mary usually wears earrings.
[#8594682] Try not to be so pessimistic.
... I would clearly choose the latter. They are simple, but they are fresh and useful. You are one of the people who can really make a difference. :)
◼ ◼ Question 2 Details
There is an issue about this on GitHub.
My suggestion was to allow members to link same language equivalents as we have been doing, using the same linking system, and then put such linked-sentences under the main sentence, labeled as "Same language equivalents."
► Here is a link to see all English-English links. You will have to visit each sentence's page and look at the logs to see who actually did the linking.
► There are, of course, related sentences that are not interchangeable, so I wouldn't be linking these kinds of sentences.
There are many examples in which related English sentences can be translated by just one sentence in another language, but are not interchangeable in English, due to tense differences, pronoun differences, etc.
Here are a couple of examples, but there are many others.
ukr [#8604202] Том дуже гучно розмовляє.
Tom talks very loudly.
Tom is talking very loudly.
jpn [#192448] りんごを食べています。
He's eating an apple.
She's eating an apple.
They're eating apples.
We're eating apples.
I'm eating an apple.
1. Saman asian ilmaiseminen useilla tavoilla on hyödyllistä. Minusta luontevin tapa toteuttaa se on kääntää sama vieraskielinen lause useilla tavoilla. Jos on lisäämässä uusia lauseita (kääntämisen sijaan), minusta olisi parempi lisätä vaihtelevia ja monipuolisia lauseita. Toki kielikohtaiset sananparret ja sanonnat on hyvä lisätä, vaikka sama lause jo löytyisikin muualta.
ENG summary: It is good to add several similar translations to a given sentence. When adding original sentences, it is good to add idioms and language-specific expressions, but otherwise a greater variety of original sentences trumps a large group of similar or even synonymous original sentences.
2. Olen varovainen samankielisten lauseiden kytkemisen kanssa, koska usein samankaltaisillakin lauseilla on vivahde-eroja.
1. I think it is useful to have several variations of the same meaning. I actually wish it were the case in the languages I learn.
2. As long as the technical solution highlighted to separate the same-language links, I am kind of against, as they already should appear in the indirect translations. Once it's available, I think it would be a nice addition.
1 - It's really useful to have variants of a sentence and we (I mean people that are studying English) learn a lot. You can express yourself in different ways, it happens to all languages and it's important to know these ways.
2 - Actually, I'm not in favor of it. "I'm here." and "She's here." means different things. Both use the same structure, verb but the meaning changes a lot between them. Same as "I study English." and "I'm studying English." In the 1st sentence, it's something that I do punctually but the 2nd one tells what I'm doing now.
> Actually, I'm not in favor of it. "I'm here." and "She's here."
It would definitely be incorrect to link "I'm here" and "She's here".
I believe CK is talking about linking "I'm here" and "I am here", or something like "I think she's cool" with "I think she is cool" and "I think that she's cool", etc.
1. I think it's EXTREMELY useful to have similar sentences in the same language that express the same or fairly similar meaning.
This way, if I add the following sentence in Ukrainian:
Я знаю, що я ідіот.
I would appreciate if I could link it to all of these:
I know I'm an idiot.
I know I am an idiot.
I know that I'm an idiot.
I know that I am an idiot.
It's very useful to a language learner to be able to see all these variants, whenever possible.
2. I think we should carry on linking sentences with the same meaning in the same language together, but only when the difference is really trivial. For example, I'd link "I am a cat" and "I'm a cat", but not "I'm a cat" and "I'm a tomcat".
> 1. Do you think that it isn't useful for me to contribute English sentences with the same meanings and that I should stop doing this?
The question does not mention quantity. It omits the fact that you use automation to generate your sentences, that you add huge quantities of them, and that your sentences in general are very similar to each other in multiple aspects (names, vocabulary, structure, difficulty). So any additional elimination of variety is like flooding a store that already contains thousands of nearly identical items with even more. Not only does it make the corpus boring, it makes good translations harder to find, since they're scattered over near-duplicate sentences.
You're not the only person to add sets of sentences that differ in only small ways from each other (the pronoun, for instance), and while I'm sure that those who do it have good intentions, I wish they would put their energy into other areas. It's easy enough to find elsewhere (Wiktionary, for instance) how to conjugate a verb. It's much harder to find the meaning of a word captured in a realistic sentence. This is Tatoeba's key mission, the thing that makes it unique, and I believe we should focus on it.
I find it interesting that you've argued so hard for drastic reduction of the number of names in sentences as a means of preventing near-duplicates, and yet you're arguing for intentionally adding another kind of near-duplicates. You should think about that inconsistency.
As for whether sentences that are interchangeable should be linked together, I don't have a problem with it.
Here are the sentence and comments that led to this discussion:
And here is a search query that shows a lot of the near-duplicate sentences being added:
It's interesting to note that all of those search results also have the "do that" placeholder for more concrete actions.
It's also interesting to note that some of the sentences are wrong:
Do you think Tom will allow Mary do that? (#6071420)
Do you think that Tom will allow Mary do that? (#8180321)
This happened with another set of CK's sentences, namely these:
Tom said he's glad that that that's going to happen. (#7180272)
Tom said that he's glad that that that's going to happen. (#7180016)
When you write large numbers of near-duplicate sentences, they're boring not only to translate, but to proofread. Therefore, whether or not you're the author, it's likely that mistakes will slip by. CK reviewed both of them as OK, and he tagged one of them (#6071420) "List 907", meaning that he proofread it and thought it was suitable for language learners. I don't think there's any better evidence that quantity of sentences can interfere with quality.
Kohdassa yksi kysyit nimenomaan, että miten sinun tulisi toimia.
Ehdotan, että lisäät kiinnostavia ja muista lauseista poikkeavia ainutkertaisia englanninkielisiä lauseita. Ehdotan, ettet lisää suuria määriä itseään toistavia lauseita, jotka eivät ole vieraskielisen lauseen suoria käännöksiä.
Tämä tekisi englanninkielisistä lauseista ja sitä myötä kaikista lauseista kiinnostavampia ja monipuolisempia. Minä esimerkiksi käännän englannista melko harvoin, koska sen lauseet toistavat itseään niin paljon. Kyllästyn niihin nopeasti ja vaihdan kieltä.
Lisäksi ehdotan, että kun käännät lauseita japanista tai muista kielistä, voit lisätä niin paljon tai niin vähän käännöksiä kuin haluat. (Japaninkielisiä lauseita, joita ei ole käännetty englanniksi ja joiden kirjoittajan äidinkieli on japani, löytyy kyllä. Vielä enemmän käännettävää löytyy, jos jättää äidinkielisyysvaatimuksen pois.)
Thanuir's message in English, via Google Translate:
In section one you specifically asked what you should do.
I suggest you add unique English phrases that are interesting and different from other sentences. I suggest that you do not add large numbers of repetitive phrases that are not direct translations of a foreign phrase.
This would make the English phrases and, consequently, all the phrases more interesting and versatile. For example, I rarely translate from English because its sentences repeat themselves so much. I get tired of them quickly and change languages.
Also, I suggest that when you translate sentences from Japan or other languages, you can add as many or as few translations as you want. (Japanese sentences that have not been translated into English and whose author's mother tongue is Japanese can be found. Even more translations can be found if you omit the native language requirement.)
Although I don't translate English into other languages, I think that they're both useful.
The gist of the issue seems to be about applying moderation (as always, really). It's good to have some interchangeable sentences that only differ in spelling or syntax for learning purposes, such as "I know you don't like Tom" & "I know that you don't like Tom", "I'm hungry" and "I am hungry", "Blue is my favorite color" and "Blue is my favourite colour". Adding tens of thousands of such sentences is overkill, however, as hordes of very similar sentences look monotonous and dull, possible making new users less excited about the project... and, of course, adding such sentence pairs (triplets, etc.) adds less value to it then adding two different sentences.
> Adding tens of thousands of such sentences is overkill
You don't have to add them, but I don't see how someone adding tens of thousands similar sentences does any harm. You find it dull, but I don't. It's useful for the learners, it's useful for the AI as the source of data, we are not afraid the database will take up too much space just because of it, so there seems to be little to discuss.
Let's add more similar sentences! Join the race.
Sinänsä olen samaa mieltä, mutta toivoisin, että samankaltaiset lauseet syntyisivät käännöksinä. Tällöin niillä olisi enemmän välitöntä arvoa ja tulisivat helpommin linkitetyksi muihin lauseisiin.
It doesn't harm the project all in all, so sure CK can do as he pleases. I'm just saying focusing on more varying sentences would be quite a bit more helpful, imo.
> focusing on more varying sentences would be quite a bit more helpful
I don't necessarily disagree, but I want to point out that we're all volunteers here, we all have our own ideas of how to contribute and be useful, but also have fun. The beauty of this project is that the rules are really loose - basically, contribute complete well written sentences and don't be rude - and we all can focus on what we feel is more useful or just more fun. That's why all attempts to ban certain names (be it Sami or Tom), or impose certain names on us, or to ban similarly sounding but valid and natural translations/sentences freak me out a lot. Because if something like this makes it into the rules, I feel like it will make Tatoeba a less welcoming place comparing to what it is now.
> I don't see how someone adding tens of thousands similar sentences does any harm
> It's useful for the learners,
Not really. Learners benefit more from diverse sentences. So, if too much attention is given to translating same similar sentences about Tom and Mary learning French in Boston, this is a loss for learners.
It *could* be OK to add similar sentences if Tatoeba interface were changed. E.g. if 'Random sentences' was a weighted random that gave similar sentences less prominence, and so on. Unfortunately, changing the interface this way is difficult. So, for now, when you add too many repetitive sentences, you diminish chances of other sentences being translated.
> it's useful for the AI as the source of data
Not really. There's a problem of overfitting: if AIs are feed with too many similar automatic-generated data, they might end up with wrong conclusions. That's why it's important to have a diverse corpus.
> we are not afraid the database will take up too much space just because of it
Are we? I personally think it's better to keep database smaller, if possible. It might not be a problem for Tatoeba, but huge database certainly makes re-using the data more difficult.
> Not really. Learners benefit more from diverse sentences.
Learners do benefit from diverse sentences, I don't say there should be no diversity. But they also benefit from seeing variants of the same sentence, especially the beginners, but it's mildly useful even for advanced learners.
There's no contradiction in expanding diversity and adding variants of the same sentence. Those are just different tasks, both useful in its own way.
Those arguments are very theoretical.
If the "learner" is a dumb machine, then perhaps they will benefit from having tens of thousands of similar sentences. If it's a decently intelligent human learner (even a beginner), they will simply get bored. 🙂
We are quite unlikely to ever beat the mountain of those mass-imported sentences (the majority of the 1.3 million English sentences) with our handiwork. It does make a difference what our most productive contributors choose to put their energy in.
> If the "learner" is a dumb machine, then perhaps they will benefit from having tens of thousands of similar sentences. If it's a decently intelligent human learner (even a beginner), they will simply get bored.
Well, your arguments seem to be no less theoretical than mine, aren't they? Personally, I've never been bored by similar sentences, moreover, I thoroughly enjoy being able to study similar variants. I'm probably not a decently intelligent human learner by your standard.
Also, it's not like we are really exposed to all those tens of thousands sentences at all times. We usually search by a key word or a phrase, get a dozen of sentences, and we study that, when we're learning something. Being able to see similar variants within this set is very helpful.
"We like diversity. Unleash your creativity! Avoid using the same words, names, topics, or patterns over and over again."
I wish I could understand you, Pandaa :)
(EDIT: and also other people who chose to use their own language,which is a valid choice that I respect, of course)
> I'm probably not a decently intelligent human learner by your standard.
Of course that's not what I was trying to say. I'm sorry if it sounded like that, Denis. 🙁
> Also, it's not like we are really exposed to all those tens of thousands sentences at all times. We usually search by a key word or a phrase, get a dozen of sentences, and we study that, when we're learning something.
I'm exposed to them all the time, directly or indirectly. Most of the time, when I'm trying to use Tatoeba to find phrases that I need, I either get no hits at all or several pages of the "Tom, Mary, French, Boston" type, through which I have to browse to find a useful entry. It's similar to what Alan said somewhere else.
We are not really arguing about the existence of variants, though. I agree that they are, in general, useful. What led to this discussion was the massive linking of same-language sentences (and generally doing things massively as a kind of automated process), cf. https://tatoeba.org/deu/sentences/show/8585408.
> I'm sorry if it sounded like that
Don't worry, it didn't sound like that at all. I was just being mildly sarcastic :)
> What led to this discussion was the massive linking of same-language sentences
I read that discussion too, and I don't see any crime there. Those sentences should absolutely be linked, this saves us (translators) a lot of efforts. You translate "I know I'm crazy" once and then just link it to "I know that I'm crazy", and also to "I know that I am crazy" because they're already linked - so they're there, all together.
As opposed to translating "I know I'm crazy". Then stumbling upon "I know that I'm crazy" in a year and translating it from scratch. And then translating "I know that I am crazy" again from scratch in 6 months.
One might argue it even brings more diversity, because if they're not linked originally, I translate three sentences that are the same separately, but if they're linked, I translate them in one go, link my translation to all three, and then when I'm looking for other sentences to translate I translate some other sentences, not that one.
As for the automated process or semi-automated process, I don't have my opinion on that. It depends how it is automated, how many mistakes that introduces, what kind of mistakes, and how promptly they're fixed.
EDIT. An example to illustrate my point. I just stumbled upon those 4 sentences, which are really 4 variants of the same sentence:
I translated one of them, and linked my translation to all four. Now, when a learner of English stumbles upon my Ukrainian sentence, they would be able to see 4 ways to translate it:
At the same time, I didn't waste my time translating all four of them at some point of my translator's career, so the diversity is unaffected - I keep translating different sentences.
Linking them helps a lot all of us.
As a side note, could you tell me (either here or by PM) what is your usual process when you translate? By that I mean the criteria you use for your search or anything else that could be relevant.
Hopefully, that will be helpful to develop a way to deal with this situation.
When I binge translate, I use this link:
Translate everything from English (not necessarily English, but mostly), sentences that have audio (I turn that off though when I translate from other languages), and that have no direct translations into Ukrainian, sort order - random.
So, obviously, if we have a "cluster" of linked English sentences and at least one of them is translated into Ukrainian, I won't see any of them, which makes sense.
I also like browsing Ukrainian sentences and creating direct links from indirect, when it's appropriate.
Sometimes I just search by keywords or expressions when I'm looking for something in particular.
Plusz, nem csak a tanulók számára nem olyan hasznos, hanem a fordítóknak sem, ha ezer és ezernyi ugyanolyan mondatot fordít valaki, nincs diverzitás, csak butítja.
Nem egy fordító számolt be róla, hogy bizonyos nyelvekből csak kopott a tudása, mióta a Tatoebán fordít.
Being able to exclude sentences owned by specific users from the search would go a long way towards ensuring diversity in one's personal search results, actually it would be entirely good enough for now if one could just exclude CK, CM, CF etc.
Does this seem like most members can understand the reasons for doing the following and that it would be OK to continue doing so?
1. Contribute various sentences in the same language that have the same meanings.
2. Link sentences in the same language that are interchangeable.
1. On ok, mutta mieluiten silloin, kun lisäät ne kaikki saman toisenkielisen lauseen käännöksinä. Vaihtelevat lauseet ovat myös parempia. Kyllä ne samaatarkoittavat lauseet ilmestyvät itsekseenkin, kun lauseita käännetään kielistä toisille.
2. Jos muistan oikein, sinuahan nimenomaan pyydettiin välttämään tätä, koska se vaikuttaa oudolta uudessa lausenäkymässä. Minulla ei sinänsä ole mielipidettä asiasta.