menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
papabear papabear April 30, 2011 April 30, 2011 at 6:55:21 AM UTC link Permalink

I notice that when I translate Spanish sentences into English, my English translations often get a couple of translations of their own right away. I have a feeling that English might be the main conduit through which parallel sentence sets are made, but what if we could measure that empirically? Is there a way to determine which languages have the most direct translations?

{{vm.hiddenReplies[5933] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic April 30, 2011 April 30, 2011 at 9:03:41 AM UTC link Permalink

Of course, it will be English, because it has the more sentences, thanks to the import from the Tanaka corpus.
So this statistic should be relevant only as a proportion of the number of sentences.
And in this case, Esperanto will win, hands down...
Many languages would not be linked at all on Tatoeba if Esperanto wasn't there.

{{vm.hiddenReplies[5934] ? 'expand_more' : 'expand_less'}} hide replies show replies
Zifre Zifre April 30, 2011 April 30, 2011 at 4:24:33 PM UTC link Permalink

I've done a couple of the calculations by hand (later I'll release full stats), and so far, Latin wins hands down! Of course, Esperanto has done much more to link the corpus together, because Latin doesn't even have 1000 sentences.

Swift Swift April 30, 2011 April 30, 2011 at 12:13:00 PM UTC link Permalink

Well, Trang and sysko would be better suited to make database queries to check this out, but if you just want an overview of recent activity, you can go to a recent page in the language browser and just count up the number of languages with direct translations. Here are a few from pages 5-7:

English: 36
Japanese: 35
Esperanto: 33
French: 36
German: 37

So, these seem pretty similar, at about 1.1-1.2 linked languages per sentence. English had the greatest variation, with a number of sentences without any translations while others were heavily linked.

The religious may thus pick whichever measure that supports their faith.

{{vm.hiddenReplies[5935] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 1, 2011 May 1, 2011 at 2:50:36 AM UTC link Permalink

I had a little script scan a small portion of the database (five hundred lines near the 800 000th line of sentences.csv). It counted the number of linked languages which I then averaged to get:

heb(16): 1
pol(115): 1.01
ina(66): 1.02
epo(77): 1.12
uig(12): 1.17
fra(41): 1.2
TOTAL(501): 1.23
ita(20): 1.3
deu(29): 1.31
rus(46): 1.48
por(16): 1.69
eng(18): 1.72
lat(15): 2.07

The number in the parenthesis is the number of sentences in that language in that range (I've omitted a few of the less common languages).

As in Zifre's results Latin sits at the top with an average of over two translations per language, well above the average of 1.23 translations per language.

I'm now running the script on a part near the 845 000th line for comparison.

{{vm.hiddenReplies[5945] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 1, 2011 May 1, 2011 at 8:28:48 PM UTC link Permalink

After checking a little over 2500 newer sentences:

wuu(36): 0.83
eus(48): 0.88
tur(168): 0.99
ara(28): 1
heb(18): 1
pol(228): 1
spa(902): 1
isl(79): 1.01
ina(66): 1.02
nds(345): 1.04
ita(107): 1.07
fra(173): 1.09
epo(147): 1.12
nld(59): 1.15
deu(81): 1.17
uig(12): 1.17
TOTAL(3078): 1.22
ces(18): 1.22
rus(79): 1.28
eng(80): 1.34
glg(45): 1.36
cat(54): 1.81
por(19): 1.89
lat(15): 2.07
cmn(250): 2.87

The different parts of the corpus that I've been checking seem to be quite different depending on which mass-contributor was active at that time.

Chinese has jumped high on the list but most other languages have fallen as the newer entries don't seem to have as many links (which isn't necessarily obvious since while the older sentences have been around for longer, the younger ones got exposure to more contributors when they were added).

So, essentially, sentences are pretty poorly interlinked. It will, however, probably be worth waiting until the new database is on-line for systematically linking related sentences.

{{vm.hiddenReplies[5955] ? 'expand_more' : 'expand_less'}} hide replies show replies
ludoviko ludoviko May 1, 2011 May 1, 2011 at 11:37:55 PM UTC link Permalink

I would love to contribute linking sentences. But I probably won't add neither many links nor many sentences prior to the long awaited new version of Tatoeba. I am waiting for the possibility to see all linked sentences, directly or via other languages. This will help a lot to avoid duplicate translations which are a serious vaste of time.

{{vm.hiddenReplies[5956] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 1, 2011 May 1, 2011 at 11:53:53 PM UTC link Permalink

Yes, the new database is going to change a lot of things. Just imagine: 13 000 duplicates since the script was last run. That's a good full day's work, even for a fast typist and responsive webserver! On the other hand, it's not only time lost. Since the duplicate script merges links information, it will often contribute something (unless the duplicate added wasn't translated into any other langueage than the original was).

Translating more complex sentences should still be pretty safe.

{{vm.hiddenReplies[5958] ? 'expand_more' : 'expand_less'}} hide replies show replies
ludoviko ludoviko May 2, 2011 May 2, 2011 at 10:57:22 AM UTC link Permalink

A gigantic waste of time

If it takes one minute to translate one sentence, then 13 000 duplicates thrown away means 13 000 minutes thrown away. Or 216 working hours, more than one month of work. If it takes two minutes to translate a sentence, it's two month's work of the Tatoeba contributors. That is too much time just not to think about.

I can't see that the time of the contributors is respected by the owners of the site.

{{vm.hiddenReplies[5990] ? 'expand_more' : 'expand_less'}} hide replies show replies
MUIRIEL MUIRIEL May 2, 2011 May 2, 2011 at 11:08:20 AM UTC link Permalink

If you translate a sentence by a duplicate which gets deleted, it wasn't completely wasted, but you created a new link! So yes, linking 2 existing sentences takes less time than writing one of them, but however the work is not "thrown away".

{{vm.hiddenReplies[5991] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 12:34:13 PM UTC link Permalink

>If you translate a sentence by a duplicate which gets deleted, it wasn't completely wasted, but you created a new link! So yes, linking 2 existing sentences takes less time than writing one of them, but however the work is not "thrown away".

Exactly! And the author remains the "owner" of the link, so the merit is not lost, either.

ludoviko ludoviko May 3, 2011 May 3, 2011 at 8:13:05 AM UTC link Permalink

Wonderful. Only half the time wasted, maybe only half a month ;-|

{{vm.hiddenReplies[6015] ? 'expand_more' : 'expand_less'}} hide replies show replies
Martha Martha May 3, 2011 May 3, 2011 at 8:47:39 PM UTC link Permalink

I can't agree with you more.

Hans07 Hans07 May 2, 2011 May 2, 2011 at 11:21:15 AM UTC link Permalink

Mi kontribuis al tiu waste of time.
- Ĉar mi ne rimarkis, ke jam ekzistas (identa) traduko
- Ĉar mi ne povas forpreni mian duoblaÄ”on post rimarki ĝin fine.
Mi esperas, ke baldaĆ­ venos solvo.

sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 1:45:17 AM UTC link Permalink

Given the enormous differences of results between the extracts you made (Chinese is not even present in the 1st result and is on top in the 2d...), they just prove that small extracts of the database are just irrelevant. I suggest you're just a lousy statistician...

{{vm.hiddenReplies[5961] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 2, 2011 May 2, 2011 at 2:09:39 AM UTC link Permalink

Whereas you would make an great economist! ;-)

{{vm.hiddenReplies[5964] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 2:14:38 AM UTC link Permalink

>Whereas you would make an great economist! ;-)

Well, at least I learnt what a relevant sample is.

{{vm.hiddenReplies[5965] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 2, 2011 May 2, 2011 at 2:20:26 AM UTC link Permalink

But have you stopped beating your wife?

{{vm.hiddenReplies[5969] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 2:25:56 AM UTC link Permalink

>But have you stopped beating your wife?

@Admin: I find this personal accusation on Tatoeba's wall absolutely despicable. It is absolutely unacceptable from a moderator.
I demand an apology and the removal of Swift's status.

{{vm.hiddenReplies[5970] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 2, 2011 May 2, 2011 at 2:36:41 AM UTC link Permalink

It's not an accusation. It's a question with an implication. Similar to the implication above that I didn't know what a relevant sample is.

I do apologise for not realising that you were so ignorant of this quintessential example of this rhetorical device. I did expect more from someone like you.

I furthermore admit that I should have known better than that you'd get the reference. For that I apologise to the community, not you.

{{vm.hiddenReplies[5971] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 2:45:36 AM UTC link Permalink

>It's not an accusation. It's a question with an implication. Similar to the implication above that I didn't know what a relevant sample is.
There is an enormous difference!

1) My accusation has to do with you SKILLS as applied to Tatoeba's stats
When you accusation is PERSONAL and has nothing to do with Tatoaba AT ALL.

2) It is a fact that your samples are meaningless and that proves you know nothing of statistics.
That I "beat my wife" is just a simple disgusting accusation based on no data or evidence whatsoever. It is just plain despicable.
Next time, you'll probably call me a pedophile. It's just sickening! You are the shame of Tatoeba to throw such accusations at contributors. Shame on you!

{{vm.hiddenReplies[5972] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 2:54:17 AM UTC link Permalink

I'm just curious:

What would a "relevant sample" be, technically, in this case? Isn't a purely random sample relevant?

{{vm.hiddenReplies[5973] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 3:00:42 AM UTC link Permalink

>What would a "relevant sample" be, technically, in this case? Isn't a purely random sample relevant?

The proof it was irrelevant was in the pudding...Look at the numbers that were published: They're completely contradictory.
Probably, a relevant sample would imply taking extracts from different days and hours, from the beginning of Tatoeba up to now, instead of taking them all from the same day or days, since there will always be calendar and time biases, as different countries are in different timezones and have different holidays...

{{vm.hiddenReplies[5974] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 3:04:13 AM UTC link Permalink

> The proof it was irrelevant was in the pudding...Look at the numbers that were published: They're completely contradictory.

Q.E.D. (?)

Tellin' ya, the future is written in Latin...

FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 3:08:00 AM UTC link Permalink

I found a flaw in your proof:

Average links per language is robust (yes, "robust") against all these biases you talk off. The only biases come from the users themselves, I think. A biased sample would probably have one user who could link and knew several languages. But there's no way to really control that.

We should have Swift scan the whole database.

{{vm.hiddenReplies[5976] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 3:11:57 AM UTC link Permalink

>We should have Swift scan the whole database.

Zifre did that already (except he didn't divide the numbers, and he didn't explain what he counted...), and Twifs is busy frantically smearing contributors with accusations for various crimes...

{{vm.hiddenReplies[5977] ? 'expand_more' : 'expand_less'}} hide replies show replies
Zifre Zifre May 2, 2011 May 2, 2011 at 3:27:12 AM UTC link Permalink

> except he didn't divide the numbers

Because I don't know how to do it with SQL, and I'm too lazy to do it by hand for all 88 languages.

If you have the skill or the patience, please do it yourself.

> and he didn't explain what he counted

You're right; I should have explained it. It's simply the number of direct links for sentences in each language.

{{vm.hiddenReplies[5980] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 3:32:13 AM UTC link Permalink

I'll just go and do it.

And I'll make sure English doesn't end up #1.

{{vm.hiddenReplies[5981] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 3:44:04 AM UTC link Permalink

Actually, this is a pain... I forgot that the languages are not in the same order as by sentence quantity.

I'll just give the simplified version below:

Esperanto: 1.72
English: 2.49
Faroese: 11.25

Faroese >> all

Case closed.

{{vm.hiddenReplies[5984] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 9:03:05 AM UTC link Permalink

>Case closed.

certainly not. What has been counted are links, not translations. Links in the same language should be removed before counting...

{{vm.hiddenReplies[5988] ? 'expand_more' : 'expand_less'}} hide replies show replies
Zifre Zifre May 2, 2011 May 2, 2011 at 1:27:41 PM UTC link Permalink

True, but do you seriously think that anything is going to beat Faroese after taking that into account?

{{vm.hiddenReplies[5999] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 1:31:51 PM UTC link Permalink

>True, but do you seriously think that anything is going to beat Faroese after taking that into account?

So far, we've seen proclamations by the best statisticians on Earth nominating English, then Latin, then Chinese, and now Feroese...I'm ready for anything next!

FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 5:50:12 PM UTC link Permalink

Actually, even if we assume that all 4 (!) Faroese sentences are linked to one another, and that all the other languages have no self-links, it's still going to win by a landslide.

Go, go, Faroese!

sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 3:35:17 AM UTC link Permalink

>Because I don't know how to do it with SQL, and I'm too lazy to do it by hand for all 88 languages.

Just select the 1st column divided by the other if they are both numerical...

select a, b, a/b from table...depending maybe on your engine...

> It's simply the number of direct links for sentences in each language.

Excluding the same language? Because many sentences are linked to sentences in the same language as variants. These should be excluded from the count if you analyse the translation pivot function of languages...

{{vm.hiddenReplies[5982] ? 'expand_more' : 'expand_less'}} hide replies show replies
Zifre Zifre May 2, 2011 May 2, 2011 at 1:23:18 PM UTC link Permalink

> Just select the 1st column divided by the other if they are both numerical...

It's a lot more complicated than that. Here's the SQL I used:

select sentences.lang, count(links.a)
from sentences inner join links
on sentences.id = links.a
group by sentences.lang
order by count(links.a) desc;

where sentences has columns "id" and "lang", and links has "a" and "b".

If you know how to do this type of thing, please share. I really know nothing about relational databases.

> Excluding the same language?

No. I hadn't thought about that. But I doubt it will make a huge difference. Sometime later this week I might try to do that.

{{vm.hiddenReplies[5997] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 1:27:15 PM UTC link Permalink

you must use aliases for your selections:

select this as a, that as b, a/b as c from...group by ...

The complexity of the from clmause doesn't change this ability...

>No. I hadn't thought about that. But I doubt it will make a huge difference. Sometime later this week I might try to do that.

You would be surprised. Many people link variants in the same language. I sometimes do myself, but I know contributors who do it systematically.

sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 3:21:47 AM UTC link Permalink

>Average links per language is robust (yes, "robust") against all these biases you talk off.

I think you're wrong for the following reason: Translation from a sentence tend to come in bursts, owing to the "lasts contributions" from where many contributors operate, translating sentences on the fly. So, if you take a sample created at night CET, the chance is that you'll find more sentences from asiatic languages translated, and if you do on daytime CET, you'll find more European languages translated. The same goes with the holidays...

{{vm.hiddenReplies[5978] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 3:25:05 AM UTC link Permalink

Yes, but how does all that relate to what languages people choose to translate to and from?

If Chinese contributors only translate from English, it isn't going to change the ratio whether there's 5 of them or 500...

Anyway, we need a neutral party to look at this. Zifre's clearly an Imperialist, so any list he provides will inevitably have English on top.

{{vm.hiddenReplies[5979] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 3:39:15 AM UTC link Permalink

>Yes, but how does all that relate to what languages people choose to translate to and from?

Because, as I explained earlier, since many contributors tend to translate on the fly from the log, Asiatics tend to translate Asiatics more, Americans, Amercians, and so on...That introduces a time bias.

>Anyway, we need a neutral party to look at this. Zifre's clearly an Imperialist, so any list he provides will inevitably have English on top.

We should have somebody who understands the problem, doing it...

{{vm.hiddenReplies[5983] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 3:46:40 AM UTC link Permalink

> since many contributors tend to translate on the fly from the log, Asiatics tend to translate Asiatics more, Americans, Amercians, and so on...

I don't know what statistical evidence you're basing this on, but that's not how I translate. And I don't think that's how you translate either. But maybe you're right.

{{vm.hiddenReplies[5986] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 8:59:08 AM UTC link Permalink

>I don't know what statistical evidence you're basing this on, but that's not how I translate. And I don't think that's how you translate either. But maybe you're right.

That's experience. Everytime I create a sentence, I see translations coming immediately. Further translations are far slower to come...Look at the logs.

Swift Swift May 4, 2011 May 4, 2011 at 8:46:51 PM UTC link Permalink

Latest numbers:

nob(218): 1
pol(491): 1
heb(95): 1.01
hun(166): 1.01
isl(171): 1.01
pes(562): 1.02
fin(73): 1.03
wuu(50): 1.04
ita(373): 1.06
spa(1147): 1.07
ina(239): 1.08
nds(532): 1.13
rus(508): 1.14
uig(78): 1.19
tur(268): 1.21
TOTAL(10053): 1.27
por(93): 1.28
epo(590): 1.29
fra(898): 1.42
nld(434): 1.44
deu(701): 1.48
cmn(1522): 1.49
cat(54): 1.81
eng(361): 1.82

Further information on the data and methodology omitted on purpose but readily available to anyone who asks along with the datafiles.

{{vm.hiddenReplies[6058] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 4, 2011 May 4, 2011 at 9:07:50 PM UTC link Permalink

You enjoy ridiculing yourself, don't you? Are you a masochist?

ludoviko ludoviko May 7, 2011 May 7, 2011 at 6:42:57 PM UTC link Permalink

Trying to find a good measure for linkedness

Is it really a good idea not to consider how big the tree is which contains a certain sentence? I mean, if there are just two sentences linked, they both have only one link - but this is 100% of what those sentences can have while there is no other translation in a third language.

But if you have a simple chain of three sentences linked by two (bidirectional) links, you have two sentences with one link each and one sentence with two links, a sum of four links. But there could be 3 x 2 = 6 links, every sentence could have two links. So for two sentences it's only 67 % of the maximum.

If now, for instance (I don't know about reality), Latin would be mainly in trees having a lot of sentences while Polish would mainly be in trees having only two or three sentences, maybe the percentage of the achievable maximum would be in favour of Polish.

Probably there are some approximate relations (if everything else is equal):

- The more known a language is, the more links it has (as a percentage of the achievable maximum).

- The "older" a sentence is, the more links it has.

By the way: Do we know how many trees (graphs) there are? Or even: How many "lonely" sentences and how many trees of two/three/four sentences etc.?

{{vm.hiddenReplies[6101] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 7, 2011 May 7, 2011 at 8:34:09 PM UTC link Permalink

“Is it really a good idea not to consider how big the tree is which contains a certain sentence?”

Well, in my case it was, and for three reasons:

1) It's more computationally intensive. It'd be interesting to get some information about graph sizes, but it'd be more difficult to interpret because...

2) Two sentences from a graph may not be accurate translations of each other, so the size of the graph can represent both the number of same or similar sentences, as well as ambiguities.

3) The graph size can say something about the potential links, but that's not what I was going for.

“By the way: Do we know how many trees (graphs) there are? Or even: How many "lonely" sentences and how many trees of two/three/four sentences etc.?”

No. Well, I don't. I've only looked at the number of languages that a sentence is directly linked to.

I might have a look at it later, but as I've mentioned, I've so far just been playing around and most of the stuff I've released has been preliminary stuff for general interest.

Zifre Zifre April 30, 2011 April 30, 2011 at 3:17:03 PM UTC link Permalink

I've imported the Tatoeba data dumps into an SQLite database, and here are my stats:

http://dl.dropbox.com/u/2124822...kData/data.txt

I'll leave the interpretation up to others.

Later, I might do a bit more analysis with specific language pairs. I might also do it proportionally to the number of sentences in each language, as suggested by sacredceltic.

{{vm.hiddenReplies[5939] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 1:41:18 AM UTC link Permalink

These stats are meaningless unless you explain what you count and proportionate the numbers according to the number of sentences, otherwise, it is just an umpteenth way of positioning English on top of a list...Well done!

{{vm.hiddenReplies[5960] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 1:51:35 AM UTC link Permalink

But he DIDN'T position English on top. I think it's been concluded that the language of the future is Latin.

{{vm.hiddenReplies[5962] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 2:02:46 AM UTC link Permalink

>But he DIDN'T position English on top.

YES he DID: "eng" stands for English...
I'm responding to Zifre, so I'm refering to the list he provided as a link in the message just above mine...unless the link varies according to the reader, "eng" is at the top...

{{vm.hiddenReplies[5963] ? 'expand_more' : 'expand_less'}} hide replies show replies
FeuDRenais FeuDRenais May 2, 2011 May 2, 2011 at 2:16:37 AM UTC link Permalink

Oh, this one. Well, you can just divide each one by the number of sentences in that language, can't you? Does Esperanto win if you do that?

Swift Swift May 2, 2011 May 2, 2011 at 2:15:45 AM UTC link Permalink

That's right, Zifre! If you're going to contribute some of your personal free time to shed some light on a question raised, then you better damn well do it to everyone's perfect satisfaction (especially without having to think about it or ask politely)!

Oh, and if by any chance English lands at the top of a list -- whatever it's nature -- it's certainly not a coincidence or a natural consequence of it having by far the most sentences in the corpus. There is something sinister behind this and ... http://tatoeba.org/eng/sentences/show/1819

{{vm.hiddenReplies[5966] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 2:17:43 AM UTC link Permalink

>a natural consequence of it having by far the most sentences in the corpus.

Which I pointed in my first response in this thread...

sysko sysko May 2, 2011 May 2, 2011 at 6:09:43 PM UTC link Permalink

Looking to all the answers and numbers given, maybe it will be better first to define together some units (I don't think one is enough), and the formula used to obtain them, before giving numbers. and only when most of people will agree on these formulas, proposed numbers

I don't have time to propose formulas but I think we do need (and all of these, not only one)

1 - the proportion of sentences in that language compare to the whole database
2 - the average number of links on these sentences on a given language
3 - the average number of links on these sentences without the links to the same language (chinese to chinese)
4 - same but by compting only one link by language (for example 10 links to 10 chinese translations count only once)

5 - maybe more complicated, count only links that has been made X minutes /hours/days after the sentence has been added

after propose graphics of the evolutions of this numbers over time (if we take only the second half of sentences, or a graph group by set of 5000 sentences)

Otherwise the debates are not organized engouh, and soon you will run out of arguments guys and we will need to find something else, think long term

{{vm.hiddenReplies[6002] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko May 2, 2011 May 2, 2011 at 6:12:16 PM UTC link Permalink

more seriously the goal is to have a common way to get these numbers, this way everyone will be able to redo the formula on his own computer if he doesn't trust the result of soenone else
It will be also easier for me and Trang if we want to implements these for official stats

Of course after the funny part will be interpreting these numbers.

{{vm.hiddenReplies[6003] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift May 2, 2011 May 2, 2011 at 6:24:32 PM UTC link Permalink

> if he doesn't trust the result of soenone else

:-D Oh, lord!

sysko sysko May 2, 2011 May 2, 2011 at 6:33:18 PM UTC link Permalink

And my eternal gratitude to those who will use these numbers to answer "how can we improve the systems looking to these numbers, so that people will link more"

{{vm.hiddenReplies[6007] ? 'expand_more' : 'expand_less'}} hide replies show replies
ludoviko ludoviko May 22, 2011 May 22, 2011 at 10:07:57 AM UTC link Permalink

If you want people to link more, then put the number of links someone put on his STATS, near to "Sentences owned". People like to compete a little bit :-)

MUIRIEL MUIRIEL May 2, 2011 May 2, 2011 at 6:20:44 PM UTC link Permalink

+1 for point 4. I wondered if anyone would ever think about that^^.

{{vm.hiddenReplies[6004] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko May 2, 2011 May 2, 2011 at 6:26:54 PM UTC link Permalink

if you have other wonders, you know how to find me :P

{{vm.hiddenReplies[6006] ? 'expand_more' : 'expand_less'}} hide replies show replies
MUIRIEL MUIRIEL May 2, 2011 May 2, 2011 at 6:35:28 PM UTC link Permalink

no, this was the last wonder in my life.

{{vm.hiddenReplies[6008] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 2, 2011 May 2, 2011 at 8:22:46 PM UTC link Permalink

>no, this was the last wonder in my life.

La huitième merveille...

PS: Je suis d'accord que ce serait un luxe insolent de disposer de cette option !

Swift Swift May 2, 2011 May 2, 2011 at 6:37:54 PM UTC link Permalink

Yes, someone did.[1] I didn't think of weeding out links to sentences in the same language, though.

If anyone is actually interested (preferably not pathologically so) in my little scriptlet, I'd be more than happy to send it or the data I've compiled to them. But beware that it's just a shell script and utterly unoptimised at that.

[1] “It counted the number of linked languages”; http://tatoeba.org/eng/wall/sho...5#message_5945

{{vm.hiddenReplies[6009] ? 'expand_more' : 'expand_less'}} hide replies show replies
MUIRIEL MUIRIEL May 2, 2011 May 2, 2011 at 6:44:08 PM UTC link Permalink

Ok, I skipped that.

sacredceltic sacredceltic May 22, 2011 May 22, 2011 at 11:40:27 AM UTC link Permalink

(fra) Pour les statisticiens en herbe, intéressés par les taux de traduction de phrases dans les différentes langues de Tatoeba, j'ai réalisé le tableau suivant qui présente simplement les données brutes pour les 16 premières langues de Tatoeba, avec les pourcentages de traduction des unes vers les autres.
Cela permet de facilement repérer les efforts qui sont faits et ceux qui restent à faire.
Pour l'instant, seules les 5 premières langues (les 5 premières lignes) ont été complètement analysées mais le reste est en cours en temps réel, donc vous pouvez revoir le tableau plus tard pour constater les compléments. Les chiffres non encore calculés sont en italique.

(eng) For would-be statisticians, interested in in the translation rates of sentences in the different languages in Tatoeba, I built the following sheet that features just raw data for the 16 main languages in Tatoeba, with the percentages of translation from one to the others.
This enables to easily spot efforts that have been made and that remain to be made.
For the moment, only the first 5 languages (first 5 lines) have been completely analysed, but the remaining is under way in real time, so you can review the sheet later to view the complements. Numbers that are not yet calculated are in italic.

https://spreadsheets.google.com...JNl92Wnc&hl=fr

{{vm.hiddenReplies[6270] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 23, 2011 May 23, 2011 at 3:20:41 PM UTC link Permalink

(fra) j'ai complété le tableau
(eng) I have completed the sheet

{{vm.hiddenReplies[6303] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko May 23, 2011 May 23, 2011 at 4:00:28 PM UTC link Permalink

malheureusement les sites affiliés à google sont bloqués en Chine, je vais essayer de reconfigurer mon accès VPN dès que possible.

{{vm.hiddenReplies[6305] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic May 23, 2011 May 23, 2011 at 4:10:33 PM UTC link Permalink

ben tu découvriras une mauvaise nouvelle: Les phrases en mandarin de Tatoeba sont traduites à 66% en anglais, mais seulement à 38% en français...
Inversement, les phrases françaises ne sont traduites qu'à 34% vers le mandarin...
Il faut que tu recrutes des traducteurs du mandarin vers le français asap !

{{vm.hiddenReplies[6306] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko May 23, 2011 May 23, 2011 at 4:20:19 PM UTC link Permalink

Arf c'est la faute à Martha et à nickyeow ça. Pour le couple mandarin, français, je vais essayer de me remettre dans le bain de 10 traductions minimun/jours.

sacredceltic sacredceltic May 23, 2011 May 23, 2011 at 3:59:43 PM UTC link Permalink

(epo) por la statistikistoj interesota de la tradukprocentoj da frazoj en Tatoeba, mi realigis la sekvan tablon, kiu nur antaĆ­metas la krudajn datojn por la 16 unuaj lingvoj en Tatoeba, kun la tradukprocentoj de unu al la alia.
Tio permesas facile vidi la penojn kiujn estas farataj kaj tiujn kiujn restas farotaj.

https://spreadsheets.google.com...JNl92Wnc&hl=fr