✨ Sentences with indirect translations and no direct translations ✨
I’ve made lists of sentences in language A having indirect translations but no direct translations in language B. I think several people might find these lists quite useful.
For ease of use, the names of all lists follow the same pattern:
Indirect translations ISO1 → ISO2
For example, if you want Russian sentences with indirect translations in Esperanto (but no direct translations), the name of the list is:
Indirect translations RUS → EPO
Currently, there are lists for the 10 languages with the most sentences on Tatoeba, each list containing 500 sentences (more info below).
※ How to use ※
Even if a sentence gets a direct translation in language B, it will stays in the list until the next update. Therefore, I advise to use the Advanced search to ① fetch sentences of your preferred list, and ② exclude sentences in language B with direct links.
For example, for ENG → FRA, the following search can be used.
※ More details ※
As mentioned above, there currently exist lists for all combinations of the 10 languages with the most sentences. That is 90 lists. With 500 sentences each, that makes 45,000 sentences.
I hope to add more languages, but the number of lists, and therefore the logistic effort, quickly grows. For example, adding all combinations for the 11 languages with the most sentences would add 20 lists, 22 more lists for the 12th language, etc. ("Because there is ISO1 → ISO2, we don’t need ISO2 → ISO1" is not my way of thinking).
Therefore, I want to add USEFUL pairs first, since I cannot know what would be useful pairs of languages, I want to ask you what would be useful FOR YOU. Please let me know in this thread what pairs you’d like to have, and I will prioritize your requests.
→ Nb of sentences in each lists
Each pair of languages have (tens or hundreds of) thousands of potential candidates for addition to these lists.
I’m aware that for some pairs of languages, 500 sentences is a relatively small number, that could be tackled in a few days, and for other languages, it is a relatively big number. For logistics reason (mostly, time), no list will be updated more than once a week (and actually probably much less often than once a week). That’s why using the search is a good way to use these lists. I may add more or less sentences on each update, depending on the languages.
Lists for the following pairs of languages were added
JPN ⇄ ENG, JPN ⇄ FRA, JPN ⇄ DEU,
SPA ⇄ ENG, SPA ⇄ FRA, SPA ⇄ DEU, SPA ⇄ EPO
This is excellent! Thanks for working on it. I started looking at the RUS→ENG list, and I noticed that the sentences in that list seem a lot more diverse in terms of sentence owners than a random search of the corpus usually is. I wonder if that was an intentional aspect of the design. It's cool in any case.
It is not intended, and will depend on what slice of the database was put in the list. These lists are created with "Oldest sentences first". You may have noticed that the sentence ID of many sentences of these lists are rather small (not all of them). With such a display order, several things are worth noting (or at least, that was my intention).
- For English and a few other languages, sentences from a time where a few users used to contribute only a few sentences will appear first. Maybe that's one reason the current list appear more diverse to you.
- For languages, like French, who conjugate masculine and feminine, several translations are often added as translations of a single sentence. These translations are likely to appear in the same slice (since they were added one after another), and many links can be added at once. That, of course, negatively impacts diversity, but I think diversity is not very important when using these lists.
- The impact of heavy contributors should be slightly reduced, compared to a random search. Not that the contributions of heavy contributors are not diverse, but there are more chances to have a higher number of authors than in a random search (it will depend on the languages pair, obviously).