menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
Thanuir Thanuir September 23, 2018 September 23, 2018 at 9:33:33 AM UTC link Permalink

Is there a clever way of dealing with combinatorial explosion when adding translations?

To illustrate, consider the simple English sentence "You eat."

Finnish translations can use the singular or the plural "you", and it can also be omitted, because the verb contains that information. This gives four translations:

* Sinä syöt.
* Te syötte.
* Syöt.
* Syötte.

If we consider a slightly more complicated sentence: "You don't eat." This adds some more options for translating, since "Sinä et syö." and "Et sinä syö." are both valid orders for the words. (The other permutations are poetic or strange.)

With even slightly more complex sentence, with subclauses or words that have several possible translations, this easily leads to there being tens of valid translations, all of which are reasonable.

Does there exist a tool or method that speeds up the generation of these different variants, so as to add them to the corpus? Or should I only add a single translation, even if there are several good ones?

{{vm.hiddenReplies[29861] ? 'expand_more' : 'expand_less'}} hide replies show replies
Shishir Shishir September 23, 2018 September 23, 2018 at 9:54:50 AM UTC link Permalink

you can add all possible valid translations but there's no tool to speed up the process.

Guybrush88 Guybrush88 September 23, 2018 September 23, 2018 at 11:02:38 AM UTC link Permalink

@Thanuir multiple translations are allowed, since, as you said, a single sentence in a given language can be translated in different ways in another language. A quick hack I do when adding multiple translations is simply to copy and paste the first translation, and then you'll just have to modify a single word before submitting the new translation.

Aiji Aiji September 23, 2018 September 23, 2018 at 11:12:26 AM UTC link Permalink

Besides the other responses, I could hardly think of any computing "method that speeds up" the process. Automatic translation is far from being good enough out of context, automatic matching would not work for every case, etc. So copy paste mainly...

{{vm.hiddenReplies[29864] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir September 23, 2018 September 23, 2018 at 2:32:27 PM UTC link Permalink

I would imagine the tool would not translate automatically, but rather, given a translation, try to change it in preordained ways. Like in an English translation from Finnish, it would suggest a new translation with "he" replaced by "she", which is almost always ok. In a Finnish translation, it might suggest another one where the order of the words "minä" and "en" is switched.

These would all have to be checked by a human, of course.

AlanF_US AlanF_US September 23, 2018, edited September 23, 2018 September 23, 2018 at 12:34:00 PM UTC, edited September 23, 2018 at 12:46:41 PM UTC link Permalink

Even if someone did come up with a tool for automatically generating all variants, I would be unhappy if it were widely used.

Tatoeba's strength is that it contains sentences written by humans. It is not a dictionary, nor is it a series of tables (pronouns, conjugations, declensions). Those are useful tools, but they can be found elsewhere. In theory, one can search through Tatoeba for sentences containing a word, and find a true variety of contexts in which it is used. I say "in theory" because often that is not the case. The number of sentences automatically generated from templates is already so high that a search for a word generally returns a permutation on a small set of patterns, especially if you're looking at the shorter sentences, which, by default, are the first ones you see. Search for "forgive", and you see "Forgive me", "Forgive us", "Forgive Tom", "Forgive them", etc. Variations on a pattern like this are also easy for humans to generate, but at least they're tedious, which has the advantage that humans can get sick of doing it. Scripts don't get bored.

I recognize that there is some value to providing "Forgive me", "Forgive us", "Forgive Tom", "Forgive them". The problem is that, for my purposes at least, the value is much less than that provided by more a diverse set of sentences, and the prevalence of "templated" sentences makes it harder to find the kind I'm looking for. To mitigate the problem somewhat, I can request my search results in random order of length. But if it becomes as easy to mass-produce longer sentences that are a variation on a minor theme, then even that attempt to find diversity will be frustrated.

I realize that the kind of permutation you're talking about does not involve replacing one pronoun with another. However, I think the same kind of effect would kick in if large numbers of sentences with effectively the same meaning were added.

I'm sure there will be people who disagree with me, but I believe that Tatoeba best serves the needs of people who already know the grammar of a language sufficiently well to be able to perform these kinds of automatic or semiautomatic substitutions. I also believe that trying to make it serve the needs of people who are not at that level interferes with the needs of the people in its core group of users.

{{vm.hiddenReplies[29866] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir September 23, 2018 September 23, 2018 at 12:54:56 PM UTC link Permalink

What is your purpose?

{{vm.hiddenReplies[29868] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US September 23, 2018 September 23, 2018 at 2:23:03 PM UTC link Permalink

I use Tatoeba to get me beyond basic knowledge of a language. In particular, I use it to help expand my vocabulary and my knowledge of grammar in context. I do this by finding sentences that are useful to me (either by browsing through the list of all sentences in that language written by native-level speakers, or by searching for particular words), adding them to lists, downloading the lists, and importing them into Anki, a spaced repetition flashcard tool. I find that very simple sentences are less useful to me than ones that are a little longer because the longer ones give me something more substantial to retain. For instance, I can choose sentences where there is some kind of similarity, even accidental, between, two of the words it contains. Then I can use one to help me remember the other.

{{vm.hiddenReplies[29871] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir September 23, 2018 September 23, 2018 at 2:28:52 PM UTC link Permalink

Okay, thanks.

It seems that your problem is with simple sentences, rather than multiple variations of a more complicated sentence.

{{vm.hiddenReplies[29873] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US September 23, 2018 September 23, 2018 at 3:44:37 PM UTC link Permalink

Indeed, if people automatically generated complex sentences, I might be better able to find the kind of sentences I like to memorize than if they did the same with simple sentences. But I would still prefer "custom-crafted" sentences. Note that there are at least two aspects to the problem that I, and people who are using Tatoeba in a similar way, face:
- a scarcity of the kind of sentences that we are looking for
- a surplus of the kind of sentences that we are not looking for (making it harder to find the other kind)

It's easy to see how automatic generation of sentences, whether simple or complex, adds to the surplus part of the equation. It may not be immediately obvious how it adds to the scarcity part, but if people are turning their attention toward trying to automatically create exhaustive sets of sentences with little variation between them, then they're diverting their energy from the value that they can add by adding more varied sentences.

The fact that the "surplus" problem is most noticeable with simple sentences is partly a consequence of the fact that it's harder to write a script to perform these permutations with more complex sentences, so people tend not to have done it as much.

{{vm.hiddenReplies[29877] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir September 23, 2018 September 23, 2018 at 4:12:19 PM UTC link Permalink

I don't know if any has added sentences automatically - I, for sure, have not, and I think with current technology any such solution would require human oversight, anyway.

{{vm.hiddenReplies[29878] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US September 25, 2018 September 25, 2018 at 12:42:16 PM UTC link Permalink

Yes, naturally there's human oversight in the process. People have to write and execute the scripts in the first place, and they have to add the resulting sentences, and undoubtedly they proofread them in between. The problem is that the automatic generation itself produces dull sentences that lack the context that makes them valuable for learning.

I have no problem with people adding both "he" and "she" translations of sentences that make no distinction between those pronouns in the source language. But as others have said, that can be done easily by copy/paste/modify, especially if you use the "copy" icon to the right of a sentence (the rectangle with parallel lines), which copies the content of the sentence to the clipboard.

Thanuir Thanuir September 23, 2018 September 23, 2018 at 12:34:09 PM UTC link Permalink

For example, I kind of ran out of steam here: https://tatoeba.org/dan/sentences/show/30341

{{vm.hiddenReplies[29867] ? 'expand_more' : 'expand_less'}} hide replies show replies
Orava Orava September 23, 2018 September 23, 2018 at 1:58:46 PM UTC link Permalink

It's true that this sentence can be translated in many ways. All of these seem to be grammatically correct, but I'm not sure if _all_ of them are very natural, for example using the -tta vern ending doesn't feel very natural to me in this context. I feel like the word ikinä has an emotional charge (or something) that doesn't sound natural in this context. In my opinion, ikinä could be often translated as 'never ever'. I think it's important to consider what kind of contributions are the most helpful for people who are using Tatoeba in their studies.

{{vm.hiddenReplies[29869] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir September 23, 2018 September 23, 2018 at 2:23:10 PM UTC link Permalink

I think "ikinä" is part of usual speech and writing, whereas "never ever" sounds quite strange and maybe even childish. So I would say that "ikinä" is fine, here. This is only based on my feeling, not research or training.

I can certainly write or say "Yrittämättä et saa koskaan tietää." etc. without cringing or feeling like a poet, which has been my standard thus far for whether a sentence is sufficiently natural or not. I would not add, for example, "Yrittämättä onnistu et.".

If you have better guidelines, then please let me know and I'll consider them.

{{vm.hiddenReplies[29872] ? 'expand_more' : 'expand_less'}} hide replies show replies
Orava Orava September 23, 2018 September 23, 2018 at 2:30:48 PM UTC link Permalink

https://en.wiktionary.org/wiki/ikinä

{{vm.hiddenReplies[29874] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir September 23, 2018 September 23, 2018 at 2:34:34 PM UTC link Permalink

That shows an example with "never ever" and examples without that extra emphasis. Could you be more explicit about what you mean here?