Some minor stats just because I was curious.
The TOP-25 contributor chart:
This chart above takes into account information available from CK's profile that the following accounts: CK CM CN CF CH CT CC Source_VOA RM are his, so I merged contributions from these accounts into one - CK - for the sake of this statistics.
The role of one top contributor in TOP-25 languages (the last column is the percentage of the corpus in that language that belongs to that contributor):
How would that TOP-25 look like without the top contributor?
Kiitos, nämä ovat kiinnostavia tietoja.
No worries. I'm trying to figure out how to represent how dependent each language corpus is on a very limited number of contributors, so this was the first step.
I'm not sure whether it's some kind of bug/disintegrity, or whether it's by design.
Most of the links from links.csv have a "mirroring" pair, so if sentence X is linked to Y, Y is linked to X (and that's a separate record). However, there are some pairs that don't have this mirroring record. Can someone please explain what's the deal with those?
#247164 <-> #5078553
#1943259 <-> #3942318
#1943259 <-> #3942190
#1943259 <-> #3942320
#1918220 <-> #1918219
#1918236 <-> #1918235
First of all, thank you very much for the tables. That's very interesting information. May I ask if "perc" and "total_sentences_minus_top_contributor" take into account that some users have sentences in different languages? So they actually are numbers without TOP 25 contributors sentences removed (and not just a subtraction)?
For your question about links, I think you will find (some) answers there https://github.com/Tatoeba/tatoeba2/issues/2269
> May I ask if "perc" and "total_sentences_minus_top_contributor" take into account that some users have sentences in different languages?
They only take into account the sentences in that language.
So, the very first chart (TOP-25 contributor chart) doesn't take into account the language in which those are contributions were made. So, for example, maaster owns 77,059 sentences in different languages. It's a very simple chart, a more complicated version of it is being posted by CK from time to time - sentence count in native language.
The second chart shows the top contributor in each language, and then the number of sentences in that language only, and the top contributor's share in the total language corpus. So, in case of maaster and Hungrarian, it shows that 71,309 of Hungarian sentences belong to maaster. And the percentage calculation is used based on this number.
The third charts shows what would happen to the corpus if the top contributor of this language hadn't contributed in that language only. So maaster's sentences are not taken into account for the Hungarian corpus numbers, but they are taken into account for the English corpus numbers.
Also, I've been using "contributions" and "sentence owned" interchangeably, I'm aware those terms are not really the same, some people unown their sentences in non-native languages, so while technically they contributed them, they don't own them. The statistics only takes into account sentences owned.
EDIT Please proceed to this post about the clouds:
The post below is based upon a script that had a mistake in it, but I'm leaving it here for same of discussion.
One more thing I've always been interested in are group of sentences linked through direct and indirect translations (and by indirect I mean not only the first level of that, but any level - so translations of translations of translations, and then translations of translations of translations of translations, etc.).
I call those "clouds".
Each cloud, for sake of this stats, is formed around the sentence with the smallest ID that belongs to the same group.
There are 1,942,648 "clouds". 838,821 "clouds" consist of just one sentence - meaning there are 838,821 sentences without a single translations. This means there are 1,103,827 non-trivial clouds.
The biggest cloud is the cloud formed around "I love you" sentence - it contains 822 sentences. Here it is: https://tatoeba.org/eng/sentences/show/1434
The second largest cloud is formed around "你在干什麼啊？" - there are 815 sentences in the cloud. This one: https://tatoeba.org/eng/sentences/show/3
Top 100 clouds:
I'll be using reddit for my stats in the future. It has better formatting options comparing to tatoeba.
Kiitos, mietinkin tätä joskus. Mahtaakohan lausepilvien kokojen jakauma noudattaa potenssilakia (power law) tai jotain muuta yksinkertaista jakaumaa? Suuremman lausepilven lauseista useampi voidaan kääntää, joten jonkinlainen Matteus-vaikutus lienee käynnissä.
I've translated your message using google translate, so I'm not entirely sure I've got it right, but google translate thinks you talk about "word clouds", which is probably something different from what I meant by "sentence clouds".
Anteeksi, ajatusvirhe. Muokkasin: sana -> lause.
Cheers. I'm not sure about the distribution laws though... I believe the more common phrases will be translated more often, so I would expect to see some really common expressions in the largest clouds.
What else interests me is the "Chinese whispers" effect... Namely, sentences in the same language linked through translations into different languages - how far away can they be from each other?
Here are, for example, all English sentences in the biggest "I love you" cloud in no particular order:
("I love roads" stands out, and I also like it how "I love you more than anything in this world" and "I used to have a little crush on you" are in the same cloud, so they are indirect translations of each other, maybe through a couple of languages)
And these ones are from the fourth cloud, "Unmöglich!":
"I love roads." on suomen kautta: Lauseessa "Minä rakastan teitä" viimeinen sana, "teitä", on sekä sanan "tie" että sanan "te" taivutusmuoto.
And because translate might have some difficulty with sentences in several languages and so on:
"Minä rakastan teitä." means "I love y'all/roads.", since "teitä" is a form of both the words "tie" (road) and "te" (plural you).
Kiinnostavampaa voisivat kuitenkin olla tilanteet, joissa sanan merkitys muuttuu vähitellen, koska eri kielet käyttävät päällekkäisiä mutta erilaisia tunnekäsitteitä (vaikkapa). Luulisi, että joku lauseista
"Tom is [adjective]." johtaisi tällaisiin. Kenties "Tom is angry." tai "Tom is sad." johtaisi kiinnostaviin rikkinäisen puhelimen aiheuttamiin merkitysketjuihin.
> "I love roads." on suomen kautta: Lauseessa "Minä rakastan teitä" viimeinen sana, "teitä", on sekä sanan "tie" että sanan "te" taivutusmuoto.
Ha-ha, thanks for the explanation!
Yeah, I do have a general understanding how the principle of Chinese Whisper works in translations and back translations, but it fascinates me nevertheless!
I think you made a mistake computing the clouds, if you really want them to contain indirect translations at any level. Here's a path from #1434 to #1 :
#1434 (eng): I love you.
#1661532 (jpn): 大好き。
#2823 (spa): Me gusta mucho.
#5634 (rus): Мне это очень нравится.
#1121355 (toki): ni li pona tawa mi.
#520388 (epo): Mi konsentas.
#5070 (jpn): 賛成です。
#726657 (pol): Zgoda.
#268108 (eng): All right.
#4726230 (spa): ¡Órale!
#241077 (eng): Let's go!
#1368157 (por): Vamos lá!
#1 (cmn): 我們試試看！
So #1434 and #1 should be part of the same cloud. According to my count, there are 402696 sentences in that cloud.
Thanks! Good point. I'll review my algorithm.
I've revised the script and it seems to be working properly now. I've also re-run it using the latest downloadables.
The first cloud is indeed a humongous one.
It consists of 405,231 sentences. The runner-up has "only" 2,987 sentences.
Here is the top-50:
The first cloud deserves a special attention.
I wonder whether it exists solely because of the "Chinese whispers" effect (you translate a sentence to different languages, the translations are being translated, etc.) and, because of all kind of linguistic ambiguities, the meaning can change significantly in a few iterations; or, maybe, there are some sentences linked incorrectly. Most probably, both factors play some role in it, but, hopefully, the "Chinese effect" is by far the most influential one.
It contains 29,434 English sentences, and every time I tried to select some random sentences from this cloud they all seemed so random it blows my mind that each one of them is an indirect translation of any other sentence from the list. Here is an example:
If you are curious how, for example, how "Tom is beautiful" became "You know what I'm saying?", here is a path (there could be a few though):
#7260496 Lang:eng Tom is beautiful. -> #1998311 Lang:deu Tom ist hübsch. -> #2609133 Lang:rus Том симпатичный. -> #2609145 Lang:fra Tom est sympathique. -> #1495932 Lang:deu Tom ist freundlich. -> #5261794 Lang:epo Tomo estas afabla. -> #3187087 Lang:ita Tom è premuroso. -> #6089104 Lang:epo Tomo estas zorgema. -> #2524634 Lang:ita Tom è diligente. -> #2236811 Lang:eng Tom is persevering. -> #4334330 Lang:mkd Том е упорен. -> #2203178 Lang:eng Tom is persistent. -> #5948874 Lang:ukr Том наполегливий. -> #5854379 Lang:eng Tom is pushy. -> #4439850 Lang:por Tom é agressivo. -> #4439655 Lang:swe Tom är aggressiv. -> #2202554 Lang:eng Tom is aggressive. -> #5421164 Lang:heb תום תוקפני. -> #3238507 Lang:eng He is aggressive. -> #3238503 Lang:ita È intraprendente. -> #313128 Lang:eng She is aggressive. -> #90584 Lang:jpn 彼女は気が強い。 -> #1300166 Lang:eng She's strong-willed. -> #865246 Lang:spa Ella es obstinada. -> #1327985 Lang:ita Lei è ostinata. -> #2203130 Lang:eng You're obstinate. -> #2305634 Lang:deu Sie sind starrsinnig. -> #4220362 Lang:fra Vous êtes obstinée. -> #5226243 Lang:epo Vi estas obstina. -> #7316102 Lang:deu Du bist zäh. -> #40603 Lang:eng You're tough. -> #5407208 Lang:heb אתם קשוחים. -> #2202937 Lang:eng You're harsh. -> #2401356 Lang:ita È dura. -> #869099 Lang:fra C'est difficile. -> #2267970 Lang:rus Это сложно. -> #2298741 Lang:toki ni li ike. -> #2111431 Lang:eng That's immoral. -> #8452386 Lang:yid דאָס איז אומגערעכט. -> #2111384 Lang:eng That's wrong. -> #8452389 Lang:yid דאָס איז נישט אמת. -> #796706 Lang:fra C'est pas vrai. -> #240209 Lang:eng You cannot be serious. -> #3443732 Lang:rus Ты это серьёзно? -> #373215 Lang:eng Are you serious? -> #81591 Lang:jpn 本気？ -> #1326 Lang:eng Are you sure? -> #1356753 Lang:eng Are you certain? -> #7312084 Lang:toki sina sona ala sona? -> #455452 Lang:deu Verstanden? -> #461080 Lang:pol Rozumiesz? -> #1360379 Lang:ukr Розумієш? -> #250819 Lang:eng You see what I mean? -> #1261509 Lang:fra Tu vois ce que je veux dire ? -> #3655940 Lang:kor 무슨 말인지 알겠니? -> #3655938 Lang:kor 유남생? -> #1853696 Lang:eng You know what I'm saying?
I don't know many of the languages on the list so I have no idea whether it's a legitimate "Chinese whisper path" or there's some human error there, but at least the transition from a statement to a question is legitimate: Lang:eng You cannot be serious. -> 3443732 Lang:rus Ты это серьёзно?.
I can't understand whether the name Tom was lost legitimately, here is this bit: 2202554 Lang:eng Tom is aggressive. -> 5421164 Lang:heb תום תוקפני. -> 3238507 Lang:eng He is aggressive.
All the sentences from Cloud 1 are here:
This can be viewed online or downloaded as xls. It has a separate tab for sentences in English only.
If you're curious about how any two sentences from the list are linked with each other, please send me a PM or let me know here, I've got a program for that.
TOP 10 contributors for each of top 100 languages
Some names are mentioned more than once. The ones mentioned 3 times or more:
carlosalberto - 7 times;
Balamax - 6 times;
al_ex_an_der - 4 times;
Amastan - 4 times;
nonong - 4 times;
shekitten - 3 times;
FeuDRenais - 3 times;
Hans07 - 3 times;
Esperantostern - 3 times;
cueyayotl - 3 times;
Pfirsichbaeumchen - 3 times;
alexmarcelo - 3 times;