menu
Tatoeba
language
注册 登录
language 中文(大陆简体)
menu
Tatoeba

chevron_right 注册

chevron_right 登录

浏览

chevron_right 随机句子

chevron_right 选择语言

chevron_right 选择列表

chevron_right 选择标签

chevron_right 选择音频

社群

chevron_right 留言板

chevron_right 用户列表

chevron_right 用户的语言

chevron_right 母语者

search
clear
swap_horiz
search
sharptoothed sharptoothed 2018年9月24日 2018年9月24日 UTC 上午9:37:05 link 永久链接

** Stats & Graphs **

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/

2 admins and other members to whom it may concern:
Format of the 'users.csv' file available in Tatoeba exports directory (https://downloads.tatoeba.org/exports/) has been changed unexpectedly. I noticed it too late so 'Users Total' counters' data in 'User Activity' and 'Contributors' charts for 2018-09-08 and 2018-09-15 has been lost. Sorry guys. :-(
The 'users.csv' file doesn't contain 'user registration date' field anymore. As a result, 'User Colours' feature is broken now. Too bad. :-(

{{vm.hiddenReplies[29909] ? 'expand_more' : 'expand_less'}} 隐藏回复 显示回复
Guybrush88 Guybrush88 2018年9月24日 2018年9月24日 UTC 上午9:42:55 link 永久链接

as always, thanks for your stats :)

about the issue you reported about the 'users.csv' file, I opened a ticket on the bug tracker for this: https://github.com/Tatoeba/tatoeba2/issues/1674

{{vm.hiddenReplies[29911] ? 'expand_more' : 'expand_less'}} 隐藏回复 显示回复
sharptoothed sharptoothed 2018年9月24日 2018年9月24日 UTC 上午9:45:53 link 永久链接

Thanks a lot! :-)

{{vm.hiddenReplies[29912] ? 'expand_more' : 'expand_less'}} 隐藏回复 显示回复
Guybrush88 Guybrush88 2018年9月24日 2018年9月24日 UTC 上午9:59:43 link 永久链接

you're welcome. Also, if you have some details about this part "I noticed it too late so 'Users Total' counters' data in 'User Activity' and 'Contributors' charts for 2018-09-08 and 2018-09-15 has been lost. Sorry guys. :-(", I'll add them to the bug report

{{vm.hiddenReplies[29913] ? 'expand_more' : 'expand_less'}} 隐藏回复 显示回复
sharptoothed sharptoothed 2018年9月24日 2018年9月24日 UTC 上午10:33:10 link 永久链接

I have no details to add, actually. I had no time to check the data on those dates so I just processed the dumps and updated my databases blindly.

TRANG TRANG 2018年9月26日 2018年9月26日 UTC 上午10:41:30 link 永久链接

Sorry I broke your tool!

For the first problem, which is that you're now missing data, I have to say that at this stage I am not confident it's okay for us to export the registration date. It's kind of personal and we're doing this without explicit consent from the users. To be honest, the registration date is not the only thing. Anything that is user related is, generally speaking, questionable to export without the user's explicit consent.

Gillux expressed the issue quite well in this comment:
https://github.com/Tatoeba/tato...ment-180814467

We're not doing anything outrageous in terms of privacy violation, but we're far from doing the best we can. Back when we started exporting more and more data, we really didn't think much about privacy and we still haven't taken the time to seriously think/discuss about this topic.

Perhaps now is an occasion...

I personally feel uncomfortable exporting user related data, knowing that we have not asked permission. We just figured it's useful/interesting and assumed everyone was fine with us exporting this data. That's maybe not the case, maybe some people are not fine with that.

If you (or anyone) think differently about this issue, feel free to share your opinion.


As for the other problem, in the GitHub issue, gillux asked:

> How to keep the format of exported data consistent over time,
> so that people can use it without fearing unexpected changes?

In this particular case, we could have kept the same number of columns in the CSV file but just put NULL values in the columns we wanted to remove. This would have at least prevented your scripts to break. If that's still relevant, we can push a fix for that.

But I think keeping the format consistent is only half of the problem. In some cases it may be impossible or not reasonable for us to keep the format consistent indefinitely.

The other half of the problem is about how to communicate and synchronize the changes with the consumers of our data. Maybe it's time we revive our Google Group. Or maybe there is another more appropriate platform. But this incident shows we need to put more efforts into being more connected with those who reuse our data.

{{vm.hiddenReplies[30004] ? 'expand_more' : 'expand_less'}} 隐藏回复 显示回复
sharptoothed sharptoothed 2018年9月26日, edited 2018年9月27日 2018年9月26日 UTC 下午12:07:05, edited 2018年9月27日 UTC 上午11:47:22 link 永久链接

Thanks for the clarification, Trang! :-)
Indeed, personal data is quite a delicate question since even seemingly unimportant data may suddenly be found important. In this sense, any personal info and other user related data shouldn't be exported, nor should it be visible on the web-site unless the user permitted it explicitly. Parsing the contents of a web-page is just a little bit harder than parsing the contents of a data dump, you know. :-)

As for the file format change, I've adapted my scripts and it's not an issue any more. Maybe it's worth to switch from CSV data format to XML or, say, JSON to avoid such problems in future.

Aiji Aiji 2018年9月27日, edited 2018年9月28日 2018年9月27日 UTC 上午10:11:02, edited 2018年9月28日 UTC 上午8:01:45 link 永久链接

I think the same about user personal data. It's fine to use the data to perform analysis intra-Tatoeba, but exporting the content of a profile is not really fine actually.
Besides, thinking farther than "it's cool to have data exported", I cannot really think of any relevant data for extra-Tatoeba tools, except the proficiency of the languages. Would there be any other?

About the change of format, I don't think that keeping the same format forever to ensure tools compatibility is the best solution. Once in a wild, as you said, the format HAS to change. That's why every software has so many versions^^
I'm not very kind so for me it's just fine to do your changes, and somehow spread the message that there's is a new format (just need to find how to spread the message^^). If the tool is currently used / monitored, people will eventually see the problem. (Lost data due to incompatibility is another issue, but it's not really your fault).

Another option is to duplicate but... I mean you keep the last version of old format at its place and put the version of new format somewhere else. But other problems occur, people don't realize their files are not updated, etc. That's an industry-wide issue so... :)