menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
TRANG TRANG May 3, 2015, edited May 4, 2015 May 3, 2015 at 10:06:54 PM UTC, edited May 4, 2015 at 8:08:14 AM UTC link Permalink

I've implemented more detailed stats and I'd like some people to test these changes on the dev website (http://dev.tatoeba.org).

1) The stats for sentences per language.
https://dev.tatoeba.org/eng/sta...es_by_language (edit: corrected link)
You can access them from the homepage, and then click on "show all language" below the list of top 5 languages.

2) The stats for the languages of the members, which is a new page.
http://dev.tatoeba.org/eng/stats/users_languages
On the dev, this page now shows up when you click in the menu "Members" > "Languages of members".

First, I'd like to know how you understand these stats. For instance, what do you think these numbers mean?
- http://prntscr.com/717tto
- http://prntscr.com/717ucp

Second, I'd like you to add and remove languages in your profile, and check if the stats update as expected and report to me any problem.

Thank you.

{{vm.hiddenReplies[22462] ? 'expand_more' : 'expand_less'}} hide replies show replies
gillux gillux May 4, 2015 May 4, 2015 at 3:50:47 AM UTC link Permalink

Nice improvement!

> 1) The stats for sentences per language.
> http://dev.tatoeba.org/eng/stats/users_languages

You mean https://dev.tatoeba.org/eng/sta...s_by_language, right?
I personally like the colored bust to represent “admin”, “corpus maintainers” etc. in the column header, but I guess it’s a no-go for colorblinds.
It would make more sense to move the link to show_all_in/<lang> to the sentence number cell, so that you’re brought to the list of sentences when clicking on the sentences number rather than the language name.
Sortable columns would be super useful.

> 2) The stats for the languages of the members, which is a new page.
The title columns “5, 4, 3, 2, 1, ?” are *very* cryptic, I think we should totally change them. The meaning of a number itself is different among languages (for instance Japanese uses 1 for the highest level, while others uses 1 for the lowest level). It’s also confusing because the table mixes cardinal numbers for sentences and ordinal numbers for levels, which are two totally different things yet represented by the same symbol.

Again, sortable columns would be super useful.

{{vm.hiddenReplies[22463] ? 'expand_more' : 'expand_less'}} hide replies show replies
tommy_san tommy_san May 4, 2015 May 4, 2015 at 4:14:22 AM UTC link Permalink

> Japanese uses 1 for the highest level
え、そうなんですか?

{{vm.hiddenReplies[22465] ? 'expand_more' : 'expand_less'}} hide replies show replies
gillux gillux May 4, 2015 May 4, 2015 at 9:56:35 AM UTC link Permalink

あれ?一級とか一段は最高なので日本人はそういう風に考えると思ってましたけど。言語能力試験にもそうでしょう。

{{vm.hiddenReplies[22467] ? 'expand_more' : 'expand_less'}} hide replies show replies
tommy_san tommy_san May 4, 2015 May 4, 2015 at 12:34:43 PM UTC link Permalink

確かに「級」は1級が最高ですが、「段」は違うのではないでしょうか。
http://en.wikipedia.org/wiki/Rank_in_Judo
それに「レベル1」と「レベル2」では「レベル2」の方が上だと感じる人がほとんどだと思います。
他の文化圏のことは分かりませんが。

At least, it's clear to me what these numbers mean. I find the background colors also nice and intuitive.
Perhaps you could add a heading above like "Self-assessed levels" and mouseover texts "Native", "Fluent", etc. That would be enough in my opinion.

{{vm.hiddenReplies[22469] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK May 4, 2015, edited October 30, 2019 May 4, 2015 at 1:24:27 PM UTC, edited October 30, 2019 at 7:52:19 AM UTC link Permalink

[not needed anymore- removed by CK]

{{vm.hiddenReplies[22473] ? 'expand_more' : 'expand_less'}} hide replies show replies
Ooneykcall Ooneykcall May 5, 2015 May 5, 2015 at 11:22:20 PM UTC link Permalink

"However, I would also assume that many people's self-assigned "Level 4" might be higher than other people's self-assigned "Level 3" or even "Level 2."
Guess you meant lower?
Anyway, it's true there is a huge discrepancy, as some people are remarkably balder (bolder?) about their language proficiency than the other. Perhaps we should have a rough guideline concerning what each level is supposed to be like.

pullnosemans pullnosemans May 4, 2015 May 4, 2015 at 9:47:39 AM UTC link Permalink

I'm kind of wondering what relevance the number of admins, corpus maintainers, etc. bears for the number of sentences in the respective languages. wouldn't that statistic fit better with the new "languages of members" page instead of the "number of sentences" page?

{{vm.hiddenReplies[22466] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US May 4, 2015 May 4, 2015 at 12:51:53 PM UTC link Permalink

I absolutely agree. It took me about 30 seconds to figure out what was going on. My first guess was that the figures in those columns listed the number of sentences added by admins, corpus maintainers, etc. since you started collecting statistics on the dev site.

PaulP PaulP May 4, 2015 May 4, 2015 at 10:15:53 AM UTC link Permalink

> First, I'd like to know how you understand these stats. For instance, what do you think these numbers mean?
- http://prntscr.com/717tto

I have no idea. For my language Dutch it shows 1 0 0 0. That is double Dutch to me :P

User55521 User55521 May 4, 2015, edited May 4, 2015 May 4, 2015 at 1:02:48 PM UTC, edited May 4, 2015 at 1:03:17 PM UTC link Permalink

The 'progress bars' are completely useless for most languages in the current form (only a 1/4 of languages have a progress bar that actually shows something). Maybe you could use a cubic root of the number of the sentences for the progress bar, and not the exact number, or otherwise tweak it, to make it more useful?

{{vm.hiddenReplies[22471] ? 'expand_more' : 'expand_less'}} hide replies show replies
Lepotdeterre Lepotdeterre May 4, 2015 May 4, 2015 at 4:29:24 PM UTC link Permalink

I also agree, the progress bars are absolutely impractical. However, using cubic root progress bars would be even worse, because each subsequent sentence added for a particular language would contribute less. So, for the first 64 sentences, one would have 4 points on the progress bar, and after the next 64, only 5.03? That doesn't make any sense. Everything would be distorted and comparisons would be meaningless. The only thing I can suggest is making the bars much wider, and there is certainly enough space for that (at least on my Browser).

tommy_san tommy_san May 4, 2015 May 4, 2015 at 1:10:04 PM UTC link Permalink

Why are there -10 members who speak an unknown language? Who are they? ☺

TRANG TRANG May 4, 2015, edited May 4, 2015 May 4, 2015 at 9:41:11 PM UTC, edited May 4, 2015 at 10:05:00 PM UTC link Permalink

Thank you for your feedback, everyone.

I have updated the dev website to address the main issues that I felt were necessary to fix before the weekend.

I moved out the stats about the admins, corpus maintainers, etc to a dedicated page.
http://dev.tatoeba.org/eng/stats/native_speakers ("Members" > "Native speakers")
I hope things are clearer now.

I won't have time to implement the other suggestions, but I'm keeping them in mind. I don't want to spend too much time on these stats.

Just a note about the "progress bar" on the sentences stats page. They are not progress bars, they are a visual representation of the repartition of the languages in Tatoeba. You should look at it as a graph rather than a progress bars. Languages where there is no bar are languages that have barely any content compared to other languages.

{{vm.hiddenReplies[22475] ? 'expand_more' : 'expand_less'}} hide replies show replies
tommy_san tommy_san May 4, 2015 May 4, 2015 at 11:36:24 PM UTC link Permalink

I thought https://dev.tatoeba.org/stats/s...es_by_language was almost OK the way it was. It's the page linked to the top page, right? I liked it because it would visitors a clear idea about the corpus and the community. The only problem was that the page was entitled "Numbers of Sentences".

So I'd suggest using the previous version, changing the title to "Languages on Tatoeba" and adding the captions "(Number of) Sentences" and "(Number of) Native Speakers".

The list of languages sorted by the number of native speakers is of some interest, to be sure, but I don't think it's worth dedicating a single page for it. It would just do if the table was sortable.

{{vm.hiddenReplies[22478] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG May 5, 2015 May 5, 2015 at 2:56:29 PM UTC link Permalink

​> I thought https://dev.tatoeba.org/stats/s...es_by_language was almost OK the way it was. It's the page linked to the top page, right?

​Yes, it's that page.​


​> ​I liked it because it would visitors a clear idea about the corpus and the community.

​That was also my idea.

Initially I wanted to simply add a column that displays the number of corpus maintainer that we have for each language, because I wanted to clearly see how much we can take care of each language, which languages have enough corpus maintainers and which languages don't have any. And perhaps having such stats would encourage more people to try and become corpus maintainers, or search for new members who could become corpus maintainers.

Then I figured, why not display the number of contributors we have for each language, based on their status (admin, corpus maintainer, advanced contributor, contributor). At first I included all language levels, not just the native speakers. But I realised it would make more sense to only include native speakers, since it's not really relevant to see (for instance) that there is one corpus maintainer in Japanese if the level of the corpus maintainer is "beginner".

​Then I wanted to add some warning icon, for languages where we lack corpus maintainers. Perhaps display each language with a different level of warning: big warning when we lack corpus maintainers, smaller warning if we have advanced contributors but no corpus maintainers. Perhaps also green "check" if we have more than 2 corpus maintainers.​

Then I had plenty of other ideas in mind, but this is usually the point where I tell myself I'm getting carried away and I stop.


> So I'd suggest using the previous version, changing the title to "Languages on Tatoeba"
> and adding the captions "(Number of) Sentences" and "(Number of) Native Speakers".

I also felt that changing the title might be a solution to prevent the confusion, but then it led me to ask myself some questions.
Perhaps dividing the native speakers by status it is too detailed and perhaps it would be enough to either have the total number of native speakers, or maybe just the number of admins + corpus maintainers.
Perhaps there is a way to add some information about the level of expertise we have in each language, which can be an interesting information as well (some languages have many members, but none of them are native, while other languages have only a few members but most of them are native speakers).

In the end I didn't want to spend time deciding on these things, since they are not part of my initial goal, which is why I decided to create a separate page for the native speakers stats because rather than tweaking the current sentences_by_language page. I prefer to give myself more time to think of a new page that would synthesize all the information we have about each language, in a way that is satisfying enough for veteran users, and not too cryptic for new visitors.

But I completely share your vision that it would be much better if the link "all languages" would show more than the number of sentences in each language.

User55521 User55521 May 5, 2015, edited May 5, 2015 May 5, 2015 at 7:19:59 AM UTC, edited May 5, 2015 at 7:21:49 AM UTC link Permalink

> that have barely any content compared to other languages

They have barely any content compared to English and Esperanto, but not compared to most other languages.

Since very few languages are in the higher part of the list, and too many in the lower, it makes sense to make the graph that is useful for the latter ones. If you don't like square roots, you could just use a 'broken line' for the languages in the upper part of the list (like it's often used on charts, http://www.andypope.info/charts/brokencol.gif ) to show English and Esperanto have way more sentences than others. In fact, the place where the bar is broken could be used to show how much sentences these first languages have.

{{vm.hiddenReplies[22482] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG May 5, 2015 May 5, 2015 at 12:41:39 PM UTC link Permalink

You say that the bars are not useful for languages in the lower part of the graph, but what information do you expect from these bars?

For instance to me, each bar was never meant to be useful on its own. The important information is carried by the whole graph, which is there to give a global view about how the languages are spread within the corpus. If we take the root square or logarithm or whatever, it would give the wrong idea about the weight of the language in the corpus.

If the information that you are looking for is to know "how many sentences do we need to add in a certain language so that it exceeds the language above", then obviously it is not best represented by the current graph. But this kind of information should be implemented as an extra feature, not by replacing the current graph, which in my opinion is not really useless but simply has another purpose.

In any case, for other visualisations of Tatoeba's data, I encourage anyone who has a bit of programming knowledge to use the files that we export[1] and make your own visualisation, just like someone else did for the translations some time ago: https://tatoeba.org/eng/wall/sh...#message_21926
I personally loooove data visualisation but I can't afford to spend time implementing such things in Tatoeba, so I'm counting on other people to do it and share their work with us on the Wall :)

[1] http://downloads.tatoeba.org/exports/

{{vm.hiddenReplies[22483] ? 'expand_more' : 'expand_less'}} hide replies show replies
Silja Silja May 5, 2015, edited May 5, 2015 May 5, 2015 at 4:48:25 PM UTC, edited May 5, 2015 at 5:32:03 PM UTC link Permalink

How about adding how many percents each language covers of the whole database (percents of total sentences)? That would give a holostic view for also to those who prefer numbers over graphs.

Edit. It could be also interesting if there would be some comparisons between the languages present in Tatoeba and in the world. I mean, what is the most common language in Tatoeba (most speakers) vs. the most spoken languages in the world. Etc. http://www.washingtonpost.com/b...rts/?tid=sm_fb

{{vm.hiddenReplies[22486] ? 'expand_more' : 'expand_less'}} hide replies show replies
Lepotdeterre Lepotdeterre May 6, 2015 May 6, 2015 at 9:20:34 AM UTC link Permalink

I think that the percent idea is good, as long as numerical percentage is written. If it's just another progress bar, the length of each progress bar will be the same length as it is now, due to proportionality, and the less-represented languages will still have miniature lines for bars.