Menu
Is there an easy way to find out how many of the 503.559 german sentences are originals and how many are translations?
I could only filter out that there are 30.635 german sentences that have no translation at all so they must be originals, but how many originals are there in total?
And what is the distribution of languages which the non-originals were translated from.
I guess the majority are translations from english sentences (either originals or translations themselves).
How can we retrieve those metrics? Anyone?
On Tatoeba directly, it's impossible to do yet.
Offline, if you're proficient in some programming language, you can probably use the exported sentences file and links file to find about.
If none of above is possible, you may want to wait that I add this function to Tatoeba playground, the external exploration tool I develop for fun and for others to explore the corpus in ways that aren't possible on Tatoeba (self-promotion ^^) https://github.com/agrodet/Tatoeba-playground. I plan an update this week-end and maybe I will incorporate this possibility. I was thinking about it for quite some time :)
PS : Note that even if a sentence has no translation it might happen that it was a translation that was unlinked later on, hence not an original sentence.
Hi Aiji,
which platform is this tool for - or is it web-based?
And what’s the news on the update?
> how many originals are there in total?
There are 80471 original sentences in German at the moment.
> How can we retrieve those metrics? Anyone?
This is how original sentences were calculated when we introduced this information in Tatoeba:
https://github.com/Tatoeba/tato...ationShell.php
You could technically install Tatoeba, import the sentences and contributions (using the file we export weekly) into the local database then run that shell. But let's say it's not the easiest way to do it :)
There's an issue in GitHub which would solve your problem here if it was implemented:
https://github.com/Tatoeba/tatoeba2/issues/2159
When this is implemented, you would be able to get the number from the search.
In the meantime if you just needs punctually some stats, to get an idea, we can query the production database.
Thanks Trang!
Actually I would really like to see the sources for the translations of the remaining 420.000 sentences.
I guess around 3/4 will be translation from english sources, but the rest...?
It’s not exactly what you are asking, but you might be interested in this post: https://tatoeba.org/wall/show_message/21926
The data is 5+ years old, but it looks like it’s possible to rebuild it from source: https://github.com/tguinard/tatoeba_visualization
Woa, woa, woa...
That looks like a wallpaper from the 1970s. - Psychedelic...
Actually I was just hoping for an informative little five liner ;-)
Thanks anyways!
I've created this issue in GitHub:
https://github.com/Tatoeba/tatoeba2/issues/2325
This might be a task for someone during Kodoeba...
Otherwise, if I got the query correctly, German sentences are mostly translated from English, Esperanto, French, Japanese and Russian (for the top 5). Details here:
https://gist.github.com/trang/0...2caedb4c678c46
I see German in line 20 of the listing...
Sure you got the query right?
It's possible for people to translate from the same language. Here are some examples:
https://tatoeba.org/eng/sentences/show/331940
https://tatoeba.org/eng/sentences/show/331942
https://tatoeba.org/eng/sentences/show/340727
https://tatoeba.org/eng/sentences/show/341099
https://tatoeba.org/eng/sentences/show/347688
https://tatoeba.org/eng/sentences/show/349821
https://tatoeba.org/eng/sentences/show/349822
https://tatoeba.org/eng/sentences/show/350135
https://tatoeba.org/eng/sentences/show/387987
https://tatoeba.org/eng/sentences/show/430928
If you go to the sentence indicated in the logs in "This sentence was initially added as a translation of sentence ...", you'll see it leads to another German sentences.
But what’s the point here?
Translation from one language to the very same language? That’s not what I would consider a TRANS-lation.
Are you sure this is not just a linking error?
> But what’s the point here?
https://tatoeba.org/eng/wall/sh...#message_34400
> Are you sure this is not just a linking error?
It could be an error, yes. It does happen that people create translations from the wrong language as reported here:
https://github.com/Tatoeba/tatoeba2/issues/2132
In most cases, I think it's intentionally. I can't say for sure though.
But this is pure chaos.
e.g.: https://tatoeba.org/eng/sentences/show/123970
We have a wide choice of books.
En esta tienda tenemos una gran variedad de libros.
These are two totally independent sentences, yet they appear as direct translations...
I'm not sure exactly what you think is chaos.
If you consider one of the sentences listed in "Translations" is not a correct translation of "当店にはいろいろな種類の本がございます。", then you can report it in the comment. Someone will unlink the sentence.
As a general rule: if B and C are correct translations of A, B and C don't have to be correct translations of each other.
>We have a wide choice of books.
> En esta tienda tenemos una gran variedad de libros.
>These are two totally independent sentences, yet they appear as direct translations...
"Totally independent" is not accurate. The translation of the Spanish is "In this store, we have a wide choice of books", so as you can see, the second part of the Spanish sentence matches the full English sentence. However, it would better to have a closer match, so I added a comment asking for an improvement, as Trang suggested.
By ‚totally independent‘ I meant that it is impossible to evaluate (direct link)
• En esta tienda tenemos una gran variedad de libros.
• Tenemos una gran variedad de libros.
as two sentences that can be equally treated - as if they were identical.
Either the source sentence has a local prepositional complement or not.
BTW: in the english translation neither of the two spanish versions is directly linked -
and in the two portuguese versions neither the spanish nor the french versions are directly linked...
I find this pattern of lack of accuracy (regarding linking) all over the place!
When you look at a sentence page the most prominent property is not the single translation itself but rather the multiple lines of links - direct and indirect. So when I am new to Tatoeba I would definitely consider this fact as an integral part of the service. However, after looking up a few examples and finding out pretty quick that the whole linking thing is incomplete as hell, I would surely ask myself how reliable this service is and very likely move on in search of something more accurate.
EDIT:
The term ‘local prepositional complement’ did cause some misunderstandings in the following posts.
So to clarify beforehand: This was just meant as a hint. I am not referring per se to the grammatical concept of using prepositions for adding additional information about placement of things or situations. May it be a particle, a preposition or even a dedicated case to convey the notion of placement - the point is that it somehow has to be explicitly mentioned in the very sentence in order to be taken under consideration for any translation. Presuming implicit context will lead to inconsistent results.
Reducing a noun phrase to a pronoun (Our firm -> we) may be eligible for some translations from certain languages but the other way round would just be guesswork.
So I should have better said:
Either the source sentence contains any explicitly mentioned notion of placement or not.
However, the point I am trying to make is the inconsistency in the usage of direct/indirect links that inevitably lead to wrong assumptions or misunderstanding.
For the practical consequences in the Tatoeba corpus see further down
https://tatoeba.org/eng/wall/sh...#message_35230
Siitä vain linkkejä tekemään. Vapaaehtoisuuteen perustuva tietokanta tarkoittaa, että jos haluat jotain, tee se itse.
Suomennan tällä hetkellä tanskankielisiä lauseita. Linkitän niitä samalla muihin lauseisiin, joista olen riittävän varma. Mutta en linkitä omia käännöksiäni samalla, koska silloin minun pitäisi avata jokainen niistä omaan välilehteensä. Tein tätä jossain vaiheessa; se on hyvin vaivalloista.
Käyttöliittymää pitäisi muuttaa merkittävästi, jos haluaisi näyttää siinä, mitkä näkyvät lauseet ovat linkitettyjä toisiinsa ja jos haluaisi mahdollistaa näiden linkkien helpon lisäyksen ja poiston.
> I would surely ask myself how reliable this service is and very likely move on in search of something more accurate.
Good luck with that.
> Either the source sentence has a local prepositional complement or not.
That's a very funny statement considering that the original sentence you were talking about is in Japanese, a language about which you apparently have no knowledge, and also considering that all the languages listed in your profile are Western European.
A thing that some people understand after staying a while in Tatoeba is that being so sure of oneself about how languages work is a very wrong approach. I think that's why we let this part to linguists and other specialists.
I'll stop here and not comment on the "service" part to avoid my good friends around here to advice against my unnecessary aggressiveness.
> A thing that some people understand after staying a while in Tatoeba is that being so sure of oneself about how languages work is a very wrong approach. I think that's why we let this part to linguists and other specialists.
Another thing that some people understand after staying a while in Tatoeba is that being so sure of oneself about how to interpret some other persons profiles is a very wrong approach. I think that's why we let this part to the profile owner - whether why he/she/it adds some language to the list or not.
There may be some of them who won’t add languages to their profiles unless they have lived for a certain period in a part of the world where these languages are spoken natively - although they might know a thing or two about these languages.
You see, sometimes appearances are simply deceiving...
If you don’t agree I suggest you implement some policies that comments and contributions are only allowed by certified ‘linguists and other specialists’ and instead let them do the tedious conveyor belt work we ignorant dummies do ;-)
I have to quote you again - “Good luck with that!”
So when some ordinary sheep criticizes a feature that may have been implemented by you - don’t take it personally - we only see the feature and its affect on us but we don’t know YOU. I am sorry that I have rattled your cage...
And what a shame that you only consider those as friends who pat you on the shoulder but not the other ones who challenge you to sometimes have a more thorough look at things that you simply might have gotten used to over time.
Nevertheless, thanks for keeping on adding useful features to Tatoeba in the future. Much appreciated!
> You see, sometimes appearances are simply deceiving...
Did I say something about your knowledge in languages other than the one listed in your profile? No, because I have no way to judge that.
Did I imply that your comment was Western European-biased? Yes, because it was.
Did I say that you have no knowledge in Japanese? Yes, because if you'd knew "one thing or two" in Japanese, you wouldn't make such the following statement.
> Either the source sentence has a local prepositional complement or not.
So, even if I agree with you that appearances are sometimes deceiving, I think that in this case, they're far from being deceiving.
Now, for the second part of your comment, which is only here to provoke and is near the zero-level of argumentation, it amused me, so I will answer.
- First of all, to belittle yourself so others will take your side only works in school yard or political debate. You're not a sheep. Nobody says you're a sheep. And nobody criticized your right to criticize.
- You're talking things that have nothing to do with my comment, about the feature and how I was hurt, but again it doesn't work. If you have nothing to oppose to a comment, don't get lost in argumentation about unrelated topics. If you're debating alone, it's an ok-strategy but if somebody answers you, that's a loss of credit.
- I think one is free to consider who their friends are. You should read what I wrote again because I don't think I said that people who don't agree are not my friend, but more something like "my friends here", period. Again, you're arguing on a statement that you wrongly extrapolated from a simple one. It doesn't work.
If we were in speech class, I would conclude by saying that your argumentation is a very good approach if you're giving a speech to a mass audience, but a very risky one if you're arguing with someone, because it relies more on extrapolation than solid ground. If you want to lead a discussion, you have to answer or counter-attack with solid, logical counter-argument(s) that destabilize the opponent part.
Nevertheless, thanks for keeping on giving feedback on your experience on Tatoeba. Much appreciated.
> I find this pattern of lack of accuracy (regarding linking) all over
> the place!
Could you point out some examples?
The Spanish sentences you mentioned were both translated from Japanese. And based on the comments on #8764247, they are both considered valid translations of the Japanese sentence. So there's no problem with the links there.
You are perhaps misunderstanding something about the structure of the Tateoba corpus.
I don’t derive any conclusion whatsoever by taking into account the original japanese sentences.
You wrote above:
> As a general rule: if B and C are correct translations of A, B and C don't have to be correct translations of each other.
That is understood, no problem with that.
But if the original sentence e.g. somehow includes some notion that is remotely related to a local complement expressed with a preposition in many western languages and therefore can be translated in two different ways
Sentence #123970: 当店にはいろいろな種類の本がございます。(A)
• Tenemos una gran variedad de libros. (B)
• En esta tienda tenemos una gran variedad de libros. (C)
then this ‘special property’ - this ‘duplicity’ as it were - is rooted in the source language (A) and has to be an equally valid argument for any translation into any other language.
So with that in mind - if I now compare Group (B)
• We have a wide choice of books.
• Tenemos una gran variedad de libros.
• Nous avons un large choix de livres.
• Nós temos uma ampla variedade de livros.
etc.
and respectively Group (C )
• In this store, we have a wide variety of books.
• En esta tienda tenemos una gran variedad de libros.
etc.
- you get the point - then I would subsequently assume that they ALL had to be directly linked to the japanese source sentence.
But what I find instead is
• Tenemos una gran variedad de libros. (DIRECT)
• Nous avons un large choix de livres. (INDIRECT)
• We have a wide choice of books. (DIRECT)
• Nós temos uma ampla variedade de livros. (INDIRECT)
• En esta tienda tenemos una gran variedad de libros. (DIRECT)
• In this store, we have a wide variety of books. (INDIRECT)
Neither is there consistency within a group of identical sentences nor with regards to the intricacies of the source language.
And it goes on in the english version of Group (B)
Sentence #280025: We have a wide choice of books.
• Nous avons un large choix de livres. (DIRECT)
• Tenemos una gran variedad de libros. (INDIRECT)
• Nós temos uma ampla variedade de livros. (DIRECT)
I can even see
• En esta tienda tenemos una gran variedad de libros. (INDIRECT)
however no trace of
• In this store, we have a wide variety of books. (INDIRECT)
whereas in the spanish version of
Sentence #8764247: En esta tienda tenemos una gran variedad de libros.
• We have a wide choice of books. (INDIRECT)
• Tenemos una gran variedad de libros. (INDIRECT)
both indirect links are shown but in the identical english version
Sentence #8768755: In this store, we have a wide variety of books.
none of the indirect links show up!!!
———————————————————
So what should someone new to Tatoeba conclude?
Sentence #123970: 当店にはいろいろな種類の本がございます。
• En esta tienda tenemos una gran variedad de libros. (DIRECT)
This must be wrong because the INDIRECT english version ‘should’ be correct? Or vice versa? Or do just
• En esta tienda tenemos una gran variedad de libros. (DIRECT)
• In this store, we have a wide variety of books. (INDIRECT)
not correspond with each other because they are obviously handled differently regarding to their linking to the japanese original?
If you are not wildly proficient in a multitude of languages and cross culture domains it’s pretty much impossible to come to a conclusion on your own, except for the conclusion that maybe there is either something very complicated going on under the hood that doesn’t meet your understanding and expectation of how it “should” work - or the quality of the service is not sufficient! Neither of both conclusions is desirable!
And as I already stated above:
> When you look at a sentence page the most prominent property is not the single translation itself but rather the multiple lines of links - direct and indirect.
So the average person will evaluate the cross language facility as being the prominent feature of Tatoeba and hence expect a certain accuracy and quality of the service.
The current workflow for contributors however doesn’t allow for a decent management and maintenance of the linking system when providing a new contribution.
Essentially, with every new translation we gain quantity and even an automatic ‘free link’ but at the same time we introduce a multiple of inaccuracies for every already existing translation that is not correctly being linked to the new entry immediately, which of course reduces the overall quality of the whole corpus, provided that you don’t see the “single sentence approach” as the main service of Tatoeba but rather how the whole prominent linking system plays well together.
Jos ymmärrän argumenttisi oikein, niin koet Tatoeban käyttöliittymän antavan olettaa, että se sisältäisi kaikki mahdolliset suorat ja epäsuorat linkit lauseiden välillä.
Tämä ei nykyään ole tilanne ja tuskin kukaan Tatoebaan lauseita lisäävä henkilö olettaa tämän olevan totta. Syitä on muutamia:
1. Kuten aiemmin mainittu, ja kuten sinäkin mainitset, käyttöliittymä tekee tämän vaikeaksi.
2. Jotta henkilö voisi lisätä linkin kahden lauseen välille, tulee hänen osata näitä molempia kieliä riittävän hyvin. Jos palvelua ei käytä kukaan, joka osaa vaikkapa sekä viroa että italiaa, ei näiden kahden kielen välisiä lauseita tule kukaan linkittäneeksi suoraan. Lisäksi kaikki käyttäjät eivät saa linkitettyä lauseita helposti.
3. Käyttäjät ovat erimielisiä siitä, millaiset lauseet tulee linkittää toisiinsa. Esimerkiksi jotkut linkittävät lauseet kuten ”Hei.” ja ”Hei!” toisiinsa, toiset eivät.
Muitakin esteitä saattaa olla.
Näistä esteistä 2 lienee väistämätön. Emme halua ihmiset linkittävän lauseita, jos he eivät ymmärrä niitä molempia riittävän hyvin.
Estettä 3 voi madaltaa keskustelemalla siitä, millaiset lauseet tulee linkittää toisiinsa ja mitä ei.
Estettä 1 voisi madaltaa käyttöliittymää muuttamalla.
Olen melko varma, että konkreettiset ehdotukset ja työpanokset näiden esteiden poistamiseksi otetaan ilolla vastaan.
Okay, let's consider this set of sentences:
[JPN] 当店にはいろいろな種類の本がございます。(#123970)
[SPA] En esta tienda tenemos una gran variedad de libros. (#8764247)
[ENG] In this store, we have a wide variety of books. (#8768755)
If I understand correctly, your problem is that [ENG] is shown as an *indirect* translation of [JPN]. You think that if [SPA] is a direct translation of [JPN], then [ENG] should be a direct translation of [JPN] as well.
Assuming someone who speaks both English and Japanese agrees with that, then we have two ways to solve this inaccuracy:
1) An advanced contributor has to link [ENG] to [JPN].
2) A regular contributor has to add a translation to [JPN] that is the exact same text as [ENG].
In both cases, [ENG] will become a direct translation of [JPN].
The inaccuracy that you've seen everywhere is the result of another rule: if A is translated into B and B is translated into C, C is not necessarily a valid translation of A. A human has to confirm that A and C are equivalent.
In our set of sentences, what happened was:
- [JPN] was translated into [SPA].
- [SPA] was translated into [ENG].
By the rule I just mentioned, we cannot automatically assume that [ENG] is a translation of [JPN]. We have to wait until someone explicitly makes these two sentences translations of each other.
If you are very confident that [ENG] is a valid translation of [JPN], then go ahead and add it as a translation.
Don’t get me wrong, I have never said that this is a task that can be automated. It’s obvious that it takes some human interaction to link these cases. That is the reason why I was questioning the available workflow.
My idea is that if (A) gets translated into (B) and another group of sentences - regardless of their origin (originals or translations) - happen to be an exact translation of each other in the whole group (including B of course) then I could consider them being (B1, B2, B3, B4 etc.) and under the rule (A)==(B)==(B1)==(B2)== etc. all of them had to be directly linked - including with (A).
Or can you think of a situation where this rule would be ambiguous?
BTW: Is the creation of the first link when translating a sentence the only automated creation of a link or are there other rules that could cause an automated creation of a link, be it direct or indirect?
Because if there is no auto-creation beside the first one I am wondering where all those INDIRECT inconsistencies like
• Tenemos una gran variedad de libros. (DIRECT)
• Nous avons un large choix de livres. (INDIRECT)
• We have a wide choice of books. (DIRECT)
• Nós temos uma ampla variedade de livros. (INDIRECT)
come from. We already have established the two ways for direct linking but someone must have created those INDIRECT links, too. Who or What is responsible for their existence?
Epäilen, että yhtäpitävien lauseiden kokoelmia on melko vähän. Esimerkiksi persoonapronominit ja artikkelit ovat erilaisia kielestä toiseen, mikä poistaa jo monia yhtäpitävyyksiä.
Epäsuora linkki:
Jos A on linkitetty B:hen ja B on linkitetty C:hen, niin A ja C ovat linkitetty epäsuorasti.
Lauseiden välillä on epäsuora linkki jos ja vain jos ne ovat kahden käännöksen päässä toisistaan, mutta eivät yhden.
Epäsuoria linkkejä siis ei niinkään tehdä, vaan niitä muodostuu suorien linkkien seurauksena.
Thanks :-)
Well, that explains a lot! So you are essentially inheriting one indirect link with every automated direct link on creation of a new translation as well as when linking two sentences manually.
1. What happens if somewhere along the chain somebody decides to unlink a sentence? Is there also some automated unlinking of indirekt links going on?
2. Is there a way to find out whether
• a generated indirect link on automated creation (when adding a new translation) is in reality - as seen from a human perspective - more likely going to be useful as a direct link or an indirect one
• a generated indirect link on manual creation (when linking in post production) is in reality rather going to be useful as a direct link or an indirect one.
So all those incorrect indirect links I was referring to in my post above seem to be just wrong guesses of the automation which is essentially always applying an indirect connection. Either at creation or re-linking.
I am wondering whether the error quote would be bigger or smaller when applying a direct link as default. Or do you guys just play save by saying “better an incorrect indirect link than an incorrect direct link” - no matter of the hit ratio?
1. Ymmärtääkseni epäsuorat linkit lasketaan suorien perusteella, eli jos suora linkki katkaistaan, epäsuoria luultavasti katoaa (mutta joskus katkaistu linkki muuttuu epäsuoraksi).
2. En ole tietoinen automaattisista tavoista pohtia linkkien pätevyyttä. Tämä olisi luultavasti hankalaa. Joku tekoälytutkija voisi sellaisen saada tehtyä.
Itse ajattelen epäsuoria linkkejä kahdessa roolissa:
a: Ne ovat työkaluja suorien linkkien luomiseksi. Voin muuttaa niitä suoriksi linkeiksi kätevästi.
b: Jos suoria linkkejä ei ole, ne antavat silti jonkinlaisen käsityksen lauseen merkityksestä. Ne siis toimivat epäluotettavampina suorien linkkien korvikkeina.
En siis pidä epäsuoraa linkkiä kahden merkitykseltään eroavan lauseen välillä ongelmana, vaan välttämättömänä seurauksena epäsuoran linkin määritelmästä/luonteesta.
...
Jos suoria linkkejä sovellettaisiin transitiivisesti, virheitä tulisi valtavasti. Lauseiden ei tarvitsisi edes olla monimutkaisia.
Esimerkiksi: He swims. <-> Hän ui. <-> She is swimming.
Tai: Tu manges. <-> You are eating. <-> Vous mangez.
As a general tip: you need to visualize the sentences as a graph.
There is a short explanation in the wiki about the structure of the corpus:
https://en.wiki.tatoeba.org/art...-is-structured
Trang, would you mind giving a short statement about the rule I mentioned a little further above:
> My idea is that if (A) gets translated to (B) and another group of sentences - regardless of their origin (originals or translations) - happen to be an exact translation of each other in the whole group (including B of course) then I could consider them being (B1, B2, B3, B4 etc.) and under the rule (A)==(B)==(B1)==(B2)== etc. all of them had to be directly linked - including to (A).
Or can you think of any situation where this rule would be ambiguous?
Järjestelmä ei sisällä tietoa siitä, mitkä lauseet ovat toistensa täsmällisiä käännöksiä.
Your rule is conflicting with the rule I mentioned: if A is translated into B and B is translated into C, C is not necessarily a valid translation of A.
Just replace C with B1, then B2.
Here's an example with a graph visualization: https://imgur.com/a/YMwGyS7
- In case #1, your rule works, you can have (A) directly linked to (B1) and (B2).
- In case #2, your rule doesn't work. It would be wrong to directly link (A) to (B1) and (B2).
@Thanuir
@TRANG
Sorry, but I should have rather linked the citation instead of just copying it. Because you missed the context. So I give you two examples from above with their context.
Both are meant to be executed by a HUMAN and not expected to be solved by ML.
1. ——————————————————
> Don’t get me wrong, I have never said that this is a task that can be automated. It’s obvious that it takes some human interaction to link these cases. That is the reason why I was questioning the available workflow.
> My idea is that if (A) gets translated into (B) and another group of sentences - regardless of their origin (originals or translations) - happen to be an exact translation of each other in the whole group (including B of course) then I could consider them being (B1, B2, B3, B4 etc.) and under the rule (A)==(B)==(B1)==(B2)== etc. all of them had to be directly linked - including with (A).
2. ——————————————————
But if the original sentence e.g. somehow includes some notion that is remotely related to a local complement expressed with a preposition in many western languages and therefore can be translated in two different ways
Sentence #123970: 当店にはいろいろな種類の本がございます。(A)
• Tenemos una gran variedad de libros. (B)
• En esta tienda tenemos una gran variedad de libros. (C)
then this ‘special property’ - this ‘duplicity’ as it were - is rooted in the source language (A) and has to be an equally valid argument for any translation into any other language.
So with that in mind - if I now compare Group (B)
• We have a wide choice of books.
• Tenemos una gran variedad de libros.
• Nous avons un large choix de livres.
• Nós temos uma ampla variedade de livros.
etc.
and respectively Group (C )
• In this store, we have a wide variety of books.
• En esta tienda tenemos una gran variedad de libros.
etc.
- you get the point - then I would subsequently assume that they ALL had to be directly linked to the japanese source sentence.
————————————————————
————————————————————
Example 2 is just an extended version of example 1 due to the fact that the japanese source can be split into two translations
• Our store... -> Group (C )
• We... -> Group (B)
We know for fact that (B) is the translation of (A).
We know for fact that (C) is the translation of (A).
So if I, as a human, assess that
• translation (B) is an exact unambiguous translation of all the unambiguous sentences in Group (B) then all the Group (B) members should also be a direct translation of (A).
• translation (C) is an exact unambiguous translation of all the unambiguous sentences in Group (C ) then all the Group (C ) members should also be a direct translation of (A).
And by ‘unambiguous’ I mean that all ‘du/Sie/You/Usted’ ambiguities are taken under consideration.
Of course I have to blindly rely that the information (A)==(B) respectively (A==(C ) is correct.
And of course (B)!=(C ) [is not equal]
Do you see any problem with this approach? Or let me phrase it differently:
Can I safely draw this LOGICAL conclusion about (B) respectively (C ) even only having little or no LINGUISTIC knowledge about the SOURCE language (A)?
Yritän kirjoitta kysymyksen ensin omin sanoin, jotta saat tietää, ymmärsinkö sen oikein vaiko enkö.
Ehdotuksesi on, että mikäli ymmärrät lauseet B ja C ja niiden merkitys on mielestäsi täsmälleen sama, ja jos A on linkitetty B:hen ja jos linkki on luotettava, niin myös A ja C olisivat suoraan linkitettäviä.
Olisin erittäin varovainen tämän kanssa. Jos olisi olemassa kaksi lausetta, B ja B', jotka molemmat ovat lauseen C tarkkoja käännöksiä, mutta niiden välillä on sävyero, niin et voi tietää vastaako A kumpaa niistä, vai kenties kumpaakin.
Esimerkiksi:
C = "Hän on siellä aina." <-> B = "He is always there."
Lausetta C' = "Hän on aina siellä." ei ole linkitetty noihin, mutta se olisi täysin pätevä ja tarkka käännös B:lle.
Sinulla on jokin lause A, joka on linkitetty lauseeseen B. Onko se hyvä käännös C:lle? Ehkä. Ehkä kyseinen kieli sallii myös englantia vapaamman sanajärjestyksen, tai sisältää jonkin muun tavan tehdä hienovaraisia eroja.
Kuitenkin minä sanoisin, että lauseiden B ja C välillä ei ole merkityseroa. Lause C' ei välttämättä tulisi mieleen, kun ajattelisin asiaa.
Epäilen siis, että onko inhimillisesti mahdollista olla varma siitä, että kaksi lausetta tarkoittavat täsmälleen samaa asiaa, tai että onko käsite edes mielekäs.
Käytännössä Tatoebassa on äärettömästi työtä kenelle tahansa, joten käyttäjä voi aivan hyvin keskittyä linkittämään lauseita tuntemissaan kielissä. Harvoin se työ loppuu kesken.
I never correlate B to C in any ways. I just threw in C because it was a concrete real life example of this japanese sentence that could be split into two different threads A - B respectively A - C...
Consider this example just being about A and B.
• If all sentences of Group B are unambiguously identical in several languages (B1, B2, B3)
• and sentence B is unambiguously identical with all the sentences in Group B
given that B is a DIRECT translation of A
• Can all sentences of Group B also be safely DIRECTLY linked to A?
————————————————————
In a concrete practical example
Sentence #123970: 当店にはいろいろな種類の本がございます。(A)
• Tenemos una gran variedad de libros. (B)
Group (B)
• We have a wide choice of books. (B1)
• Nous avons un large choix de livres. (B2)
• Nós temos uma ampla variedade de livros. (B3)
etc.
————————————————————
You can ask the same question about A -> C without ever correlating B and C.
Sentence #123970: 当店にはいろいろな種類の本がございます。(A)
• En esta tienda tenemos una gran variedad de libros. (C)
Group (C )
• In this store, we have a wide variety of books.
• En esta tienda tenemos una gran variedad de libros.
etc.
———————————————————
I am LOGICALLY inferring a correlation between Group B and sentence A simply based
• on my knowledge of language B
• on my knowledge of languages of Group B (B1, B2, B3)
• on the fact that B is a direct translation of A
Is this vulnerable?
En uskaltaisi tehdä tuota, koska vaikka lauseet B minulle vaikuttaisivat yhtäpitäviltä, kenties A tekee erottelun tai sisältää vivahteen, josta en tiedä mitään.
B:n lauseet eivät välttämättä enää ole yhtäpitäviä, kun tämän vivahteen ottaa huomioon.
So you are essentially saying that a B1, B2, B3 speaker has to inevitably know [LINGUISTICALLY] language A in order to link to A and should not draw any [LOGICAL] conclusion at all.
Because this nuance in (A) that you are mentioning is intrinsic to (A) and must already have been taken into consideration by the A-to-B translator, otherwise B would already be incorrect.
(Of course if translator A-to-B makes his mind up later on, this would jeopardize the whole chain, but this is a general problem...)
Could you come up with an example of such a nuance because I personally can’t find any vulnerability in the logical approach yet and still consider it as an option until proven wrong. (I know I am a persistent thorough little sucker :-)
As you can surely tell my main background are Romance and Germanic languages.
So if e.g. the source A were English and I had to deal with e.g.
• You are
and I see B (German translation)
• Du bist
I can safely assume that these translations are correct
• Você é
• (Tu) eres
• Tu es
• (Tu) sei
However if I saw B
• Você é
• Ustedes son
I would know that these forms are ambiguous and wouldn’t draw any logical conclusions.
However if I saw B (Spanish translation)
• Nosotras
I would know that source (A) either is referring to women only or it doesn’t make any difference and the translation to Spanish just took the liberty to use its feminine form only - which is totally valid.
So without knowing anything of the source language (A) I can only consider linking languages that make that distinction too and do also use a feminine form, although the source might allow for a masculine form too.
Knowing the source language (A) of course I could see its real intention and in case of a bi-neutral source form even add a new Spanish sentence with its masculine counterpart.
If instead of ‘nosotras’ I saw
• Nosotros
that would tell me that the source A is either explicitly masculine or bi-neutral which would only allow for translations to languages that have a dedicated masculine form because I can’t evaluate source A.
However, if I see two translations, either of the same language or even across two different languages, where one uses a masculine form and the second a feminine form, I know that the source language allows for both, otherwise one of the two translations must be wrong.
So you see where I am going with that!
En linkittäisi lauseita, joita en ymmärrä.
Ei nyt tule helppoa esimerkkiä mieleen. Epäilen että tässä on kyse tasapainosta:
Jos tulkitset lauseiden yhtäpitävyyden tarpeeksi tiukasti, huteja ei välttämättä tapahdu, mutta et löydä montakaan tilannetta, jossa sääntöä pääsisi soveltamaan.
Jos tulkitset lauseiden yhtäpitävyyttä löyhästi, niin tulee virheitä.
> Jos A on linkitetty B:hen ja B on linkitetty C:hen, niin A ja C ovat linkitetty epäsuorasti.
> Lauseiden välillä on epäsuora linkki jos ja vain jos ne ovat kahden käännöksen päässä toisistaan, mutta eivät yhden.
> Epäsuoria linkkejä siis ei niinkään tehdä, vaan niitä muodostuu suorien linkkien seurauksena.
————————————————————
Does that automatically imply that if there are (in addition to A) more sentences (even from different languages) directly linked to (B) - let’s simply call them A1, A2, A3 etc. - then they would all be automatically indirectly linked to (C ) after (B) gets directly linked to (C )?
Or in other words, the creation of the
• direct link B - C
autocreates (or calculates)
• indirect link A - C
• indirect link A1 - C
• indirect link A2 - C
• indirect link A3 - C
given the fact that every member of Group A is directly linked to (B)?
————————————————————
@TRANG
So when my simplistic point of view above (autocreation!!!) has to be translated into the world of graphs (that you mentioned in another post) I guess there is no creation of any stored INDIRECT links but rather a ‘calculation on display’ based on the object graph.
Could you share some short thoughts of how this works internally. You mentioned nodes (sentences) and their connections (links) as the only objects in this graph. So is this some kind of one/many-to-one/many clusters working together?
Or - if this is asked too much - you could just simply provide a complete list under which circumstances links are created/calculated/generated.
DIRECT LINK
• creating a new translation (system is auto-creating a link)
• manually linking by authorized user
•
•
INDIRECT LINK
•
•
•
And in case you are wondering why I am interested in this information...
I am working on some suggestions for a better review workflow/UI and in order to be able to reasonably argue about improvements I’d prefer to have a pretty complete understanding of the underlying mechanics...
> Does that automatically imply that if there are (in addition to A) more sentences (even from different languages) directly linked to (B) - let’s simply call them A1, A2, A3 etc. - then they would all be automatically indirectly linked to (C ) after (B) gets directly linked to (C )?
Tietääkseni kyllä.
En tiedä, miten Tatoeba löytää epäsuorat linkit, mutta periaatteessa ne kaikki voisi selvittää tällä tavalla: voisi laskea naapuruusmatriisin toisen potenssin, muuttaa kaikki positiiviset luvut ykkösiksi, vähentää siitä naapuruusmatrisin ja vielä nollata diagonaalin. Tämä on matemaatikon, ei ohjelmoijan, ratkaisu, eli tuskin käyttökelpoinen.
Katso esimerkiksi https://en.wikipedia.org/wiki/A...#Matrix_powers
> Could you share some short thoughts of how this works
> internally.
Imagine two tables, "sentences" and "links".
"sentences" has the following columns:
- id
- lang
- text
"links" has the following columns:
- sentence_id
- translation_id
Whenever you add a new sentence (A), a new line is added in "sentences".
- id=1, lang=eng, text=A
Whenever you add a translation (B) to the sentence (A), a new line is added in "sentences", then two lines are added in "links".
- id=2, lang=fra, text=B
- sentence_id=1, translation_id=2
- sentence_id=2, translation_id=1
The tables I have described are part of the files that we distribute under "Sentences" and "Links" on our Downloads page.
https://tatoeba.org/eng/downloads
> I am working on some suggestions for a better review
> workflow/UI
I can already tell you what a better workflow and UI could look like in the grand scheme.
1) We should have a page that allows contributor to check sentences alone, without their translations. This ensures that the items in the table "sentences" are correct.
2) We should have another page that shows only a pair of sentences and let people confirm whether or not the two sentences are translations of each other. This ensures that the items in the table "links" are correct.
3) We should provide the possibility to attach meta-data to links, not just sentences.
OK - so let me get a little bit ‚unstructured‘ and ask you some loose questions to help me get on track. I will be calling this reciprocal pair of links a ‘connection’ from now on
1. Initially a connection between two sentences always points in two directions with the help of two links.
• A is a translation of B
• B is a translation of A.
2. Breaking a connection between two sentences is achieved by removing both links.
3. Removing both links of a connection
• sentence_id=1, translation_id=2
• sentence_id=2, translation_id=1
does not affect any other link to/from either of the two participants (id=1 and id=2)
4. Re-linking a sentence is achieved by breaking the old connection (removing 2 links) and establishing a new connection (adding 2 links)
5. Is the database considered as being inconsistent if for some reason one link of this pair survives and this ‘half connection’ only points in one direction? Something like
• A is a translation of B
• B is not a translation of A
6. Sentences that are considered as being ‘indirectly linked’ are simply sentences that are two hops away from each other.
• A==B
• B==C
• A—C
7. Finding all direct links of sentence A requires to query the sentence_ID field of ALL links in the database against the ID of sentence A.
8. Finding all indirect links of sentence A requires to query the sentence_ID field of ALL links in the database against the ID of every single result of the query conducted in (7.)
9. Turning an indirect link into a direct link is achieved by establishing a new connection (adding 2 links)
• A==B (existing connection)
• A==C (existing connection)
• B==D (existing connection)
• A—D indirect link because of two hops (A==B, B==D)
• A==D (newly created connection)
But creating A==D doesn’t change anything for the already existing 2 hop relationship between A—D (A==B, B==D) - so I am essentially left with a direct and an indirect link at the same time?!?!?
There is obviously something I got wrong at an earlier stage...
10. Is there a way to distinguish between
• a connection (2 links) that is automatically being supplied/created by the system when a user contributes a translation
• a connection (2 links) that is manually created by a user either by turning an indirect link into a direct link or by simply establishing a new connection
11. How can you determine/trace the owner of a sentence and all his/her metrics (sentence count etc.)
1, 2, 3, 4: Yes.
5: Yes. If the database says that A is a translation of B, but B is not a translation of A, then it is inconsistent.
6: Yes.
7:
> Finding all direct links of sentence A requires to query the sentence_ID field of ALL links in the database against the ID of sentence A.
Well, yes and no. The database is organized in such a way that you don't need to page through all links in the database in order to find the ones connected to the ID of sentence A. But aside from efficiency considerations, the net result is the same: you are looking for all links associated with the ID of sentence A.
8.
> Finding all indirect links of sentence A requires to query the sentence_ID field of ALL links in the database against the ID of every single result of the query conducted in (7.)
Again, the algorithm is much more efficient than that, but the net result is the same.
9.
> But creating A==D doesn’t change anything for the already existing 2 hop relationship between A—D (A==B, B==D) - so I am essentially left with a direct and an indirect link at the same time?!?!?
Yes. In fact, between any two sentences that are directly linked (A==D), there can also be any number of indirect links (A==E, E==D; A==F, F==D; and so on).
10.
> Is there a way to distinguish between
• a connection (2 links) that is automatically being supplied/created by the system when a user contributes a translation
• a connection (2 links) that is manually created by a user either by turning an indirect link into a direct link or by simply establishing a new connection
Yes. There are various ways that one could keep track of such a distinction.
11.
> How can you determine/trace the owner of a sentence and all his/her metrics (sentence count etc.)
A database is designed to let you execute such queries, and do it efficiently, as long as you have recorded the relationships. In the same way that you can have one table that keeps track of sentences and their IDs, and another that keeps track of the links between pairs of sentences, you can have a table that associates sentences with owners, or vice versa.
That was a fine and helpful response, Alan.
5. Does this happen sometimes with the Tatoeba database - and if yes, is this always due to a bug or are there other possible reasons.
Can such an inconsistency make its way until the final LINK output file?
7./8. Well I am sure that the server is index optimized for that application but if somebody wants to work with the two downloadable offline output files of the database would the methods I described above be the starting point for a query and I would have to write any optimization myself?
9. I am not sure whether you really understood where I was going here.
I really meant that for my direct link A==D I could also have an indirect link A—D in the same list due to the ‘2 hop rule derivation’ from (A==B, B==D).
A—D is always present and isn’t simply invalidated by the fact that I created an additional direct link A==D. Theoretically I could even have several identical indirect links A—D (derived from several different 2 hops) for one direct link A==D.
If that is true, then for a sentence UI presentation like the Tatoeba sentence page I had to diff all indirect links A—D against a potentially already existing direct link A==D in order not to have a translation show up as both, a direct link and an indirect link in one listing of translations?
10./11. I was a little confused here because Trang made it look like the whole Tatoeba database is just comprised of these two files - SENTENCES and LINKS - and everything could be interpolated and calculated from them :-)
But it seems there is more information stored about relationships of records.
So the most important questions for me right now are
12. Which events do create a direct link?
DIRECT LINK CREATOR LIST
• creating a new translation -> system is auto-creating a connection (2 links)
• manually linking by authorized user
• manually de-linking and re-linking to another sentence by authorized user
•
•
13. Are indirect links solely derived from ‘two hop derivations’ in the graph or are there other methods/events for creating (respectively other fields for storing) indirect links somewhere in the database?
INDIRECT LINK CREATOR LIST
• Derivation from the object graph of the LINK file (2 hop rule) by the system
•
•
14. Can a programmer at Tatoeba retrieve the following information from the database.
• What - in the DIRECT LINK CREATOR LIST - has initiated the creation of every individual connection in the database?
• What - in the INDIRECT LINK CREATOR LIST - was responsible for the existence of every individual indirect connection in the database?
9. We don't display a sentence as indirect translation if it is also direct translation.
The sentences are stored using a graph structure, but they are displayed using a table structure. When we turn the graph into a table, we have to make some choices.
Yes, technically speaking, if A = B and B = D and A = D, we could display D both in the list of "Translations" and the list of "Translations of translations". But we choose not too because it's not really useful.
10/11. The core of Tatoeba are those two files: sentences and links. If you would look at the rest of the files you'll see there's more.
The exact structure of Tatoeba's database is described here:
https://tatoeba.org/eng/wall/sh...#message_35234
12. Links are created when someone adds a translation or when an advanced contributor clicks the link button.
https://en.wiki.tatoeba.org/art.../intro-linking
13. Indirect links are solely derived from two hop derivations.
14. Your question is unclear. We can retrieve who has created the link. We can (but not always) retrieve if it was created by clicking on the "translate" button or by clicking on the "link" button.
13. So if I create a new sentence and there is no translation available yet, but I do indeed see some similar sentences that offer the opportunity for being useful as indirect links, there is no way of doing this explicitly because of the two-hop-rule?
So what is the procedure to achieve this, how do I place my new sentence two hops away from all the potential candidates for getting an indirect link?
10./11. Gonna have a look at those links, thanks!
5. (after Alan‘s answer) ???
7./8. (after Alan#s answer) ???
————
One possibility is to leave a comment like I left on this sentence.
[#2645879] Tom has stopped crying. (CK) *audio*
Related:
[#6355107] Tom isn't crying anymore. (CK) *audio*
Eventually, perhaps these are related closely enough that some language can use the same sentence as a translation for both and then they will become indirectly-linked to each other.
I understand, but I would rather like to know a way to achieve this right away instead of waiting for better times.
Epäsuoria linkkejä syntyy sitä enemmän, mitä enemmän linkkejä tietokannassa on.
Yleisesti voit siis kääntää niin monia lauseita kuin osaat, niin monella tavalla kuin osaat.
Jos haluat tietylle lauseelle käännöksen, voit pyytää jotakuta, joka kääntää kyseistä kielestä, kääntämään lauseen.
I will have to leave it to the rest of the community to answer questions that are still pending.
Note that we have a dev website where you can experiment as much as you need and are free to pollute the database with test sentences and test translations.
https://dev.tatoeba.org/
You can register a new account. If you need to be granted advanced contributor status over there so that you can use the linking feature, let me know what is your account.
> how do I place my new sentence two hops away from all the potential candidates for getting an indirect link?
You create a link between your new sentence and a sentence which has a direct link (i.e. it is "one hop away") to the "potential candidate".
But I don't understand why you want to have a sentence "two hops away" from another one. Can you give a concrete example?
> 5. Does this happen sometimes with the Tatoeba database - and if yes, is this always due to a bug or are there other possible reasons. Can such an inconsistency make its way until the final LINK output file?
I've found the following pairs in the current links.csv where only one part of the link is recorded in the database:
#247164 #5078553
#1423834 #3214227
#1918219 #1918220
#1918235 #1918236
#1943238 #3752771
#1943243 #3075927
#1943259 #3942190
#1943259 #3942318
#1943259 #3942320
#1943259 #3942329
#1943259 #3942351
#3778082 #5094901
#5755767 #7207453
#5755769 #7207455
#5850721 #5868889
Without further investigation I guess these inconsistencies are results of a bug (already fixed or still in the code).
> 7./8. Well I am sure that the server is index optimized for that application but if somebody wants to work with the two downloadable offline output files of the database would the methods I described above be the starting point for a query and I would have to write any optimization myself?
You don't need the optimization if you don't care that the query takes a little bit longer.
You could also import the data into a local database.
Thanks a lot for chiming in.
13. If I create a new sentence from scratch in my own language then there is no translation (direct link) available yet. OK?
I do however see some similar sentences that I want to show up as indirect links in the translation list beneath my naked new sentence, to indicate that they are ‘only similar’ - that’s what indirect links do.
However, according to the two-hop rule there is no indirect link without a direct link inbetween - translation of a translation!
So what hack of linking and unlinking do I have to perform to have a similar sentence show up as indirect link in the list of translations without having a directly linked translation yet?
Or is there a (legal) way of doing this explicitly?
5. I was wondering if there is a special reason behind duplicating/splitting every edge into two links if they only get created and destroyed in pairs anyway?
Instead of having two links
• A==B
• B==A
I could easily get along with
• A==B
and just read it out twice, the second time just in reverse (back-to-front).
In this case the links-file would only be half the size and inconsistency would be impossible.
7./8. I just wanted to make sure whether the two downloadable files give me any bells and whistles or whether I am all alone with the pure basic data and my approach of querying the whole list is the right starting point before any optimization should kick in.
Querying the whole set several times just seems to be so ridiculously expensive.
> I do however see some similar sentences that I want to show up as indirect links in the translation list beneath my naked new sentence, to indicate that they are ‘only similar’ - that’s what indirect links do.
Do they? I may be wrong (I'm rather new to Tatoeba) but I don't think indirect links are supposed to indicate similarity between sentences.
As I understand it they are only shown because they are helpful in finding indirect translations which could be turned into direct translations.
I guess your use case is described in https://github.com/Tatoeba/tatoeba2/issues/1902 but you want to have similarity between different languages.
(I would still be interested in a concrete example from Tatoeba's corpus.)
> So what hack of linking and unlinking do I have to perform to have a similar sentence show up as indirect link in the list of translations without having a directly linked translation yet?
Since indirect links don't exist in the database and are only calculated there is no way to do that without the direct link.
> 5. I was wondering if there is a special reason behind duplicating/splitting every edge into two links if they only get created and destroyed in pairs anyway?
Good question :-) I don't think that's necessary but that part of the code was written in 2010 so I don't know the reason: https://github.com/Tatoeba/tato...52525f1R45-R47
> Querying the whole set several times just seems to be so ridiculously expensive.
That's why I would import the data into a local database.
Did you write some program/script for querying the data? I guess the most time consuming part is reading in all the data.
And yes, the two files only contain the pure data, no bells and whistles included.
@mramosh
If someone translates the orphan German sentences into English or French, I could follow in Kabyle language and many others could do the same in their language provided that they understand well. It's my opinion, even though it could be "unfair" and considered as an indue advantage for the most used wideworld languages. In the same time, I suggest to you to follow our orphan Kabyle sentences where they are made visible by other translations. This could be reciprocally helpful for German and Kabyle sentences and further for all others.
> I suggest to you to follow our orphan Kabyle sentences where they are made visible by other translations.
I am told we shouldn’t translate from languages that we don’t speak just by deducing their meaning from already existing translations to languages that we know.
@mramosh
First, please notice that I had suggested reciprocity: German / Kabyle already translated and Kabyle / German already translated.
If you follow this reasoning suggested to you, most of the little used wideworld languages will remain orphans. It goes against the spirit of a multilingual platform. Notice that I don't mind too much, not at all; I'm used to breaking my head to understand after multiple cross translations and painstaking research. If you follow my gaze, I think you will end up agreeing with me.
Only idiomatic expressions can be problematic, this difficulty being circumvented by an epistolary exchange between two or more people who really want to work for better mutual understanding. This was the case, especially with @AlanF_US, who worked wonders to understand Kabyle idioms that the great masses do not master, except the initiates.
So give me an example!
e.g. I find a Kabyle sentence that is directly linked to a German one.
How do you suggest to proceed from here and for what goal?
> e.g. I find a Kabyle sentence that is directly linked to a German one.
How do you suggest to proceed from here and for what goal?
If you read carefully, the answer is given in my above comment.
I couldn‘t figure it out, that’s why I was asking again.
The above seemed to me like you were asking us for adding a kab-ger translation by looking at some already existing kab-eng or kab-french translations...
So again:
e.g. I find a Kabyle sentence that is directly linked to a German one.
How do you suggest to proceed from here and for what goal?
And vice-versa.
Ger-kab / Kab-ger
That's what I wrote above.
How? >> with help of both French-English..
For what goal? >>> for intercomprehension.
I am still not sure whether I understand you correctly, but if you want to have the german sentence (of a ger-kab translation pair) translated to English/French in order to increase the visibility of Kabyle by creating more translations and links between kab and engl./french, then you must ask an English/French native to translate these german sentences, not a german speaker. We only translate from English/French to German because that’s what we know well.
If I guessed incorrectly could you just step-by-step describe the workflow how I as a native German speaker can be of use in your endeavor.
@mramosh
> We only translate from English/French to German because that’s what we know well.
It's exactly what I said!
But don't worry about the Kabyle language.
We are tenacious!