menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
sharptoothed sharptoothed September 24, 2018 September 24, 2018 at 9:37:05 AM UTC link Permalink

** Stats & Graphs **

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/

2 admins and other members to whom it may concern:
Format of the 'users.csv' file available in Tatoeba exports directory (https://downloads.tatoeba.org/exports/) has been changed unexpectedly. I noticed it too late so 'Users Total' counters' data in 'User Activity' and 'Contributors' charts for 2018-09-08 and 2018-09-15 has been lost. Sorry guys. :-(
The 'users.csv' file doesn't contain 'user registration date' field anymore. As a result, 'User Colours' feature is broken now. Too bad. :-(

{{vm.hiddenReplies[29909] ? 'expand_more' : 'expand_less'}} hide replies show replies
Guybrush88 Guybrush88 September 24, 2018 September 24, 2018 at 9:42:55 AM UTC link Permalink

as always, thanks for your stats :)

about the issue you reported about the 'users.csv' file, I opened a ticket on the bug tracker for this: https://github.com/Tatoeba/tatoeba2/issues/1674

{{vm.hiddenReplies[29911] ? 'expand_more' : 'expand_less'}} hide replies show replies
sharptoothed sharptoothed September 24, 2018 September 24, 2018 at 9:45:53 AM UTC link Permalink

Thanks a lot! :-)

{{vm.hiddenReplies[29912] ? 'expand_more' : 'expand_less'}} hide replies show replies
Guybrush88 Guybrush88 September 24, 2018 September 24, 2018 at 9:59:43 AM UTC link Permalink

you're welcome. Also, if you have some details about this part "I noticed it too late so 'Users Total' counters' data in 'User Activity' and 'Contributors' charts for 2018-09-08 and 2018-09-15 has been lost. Sorry guys. :-(", I'll add them to the bug report

{{vm.hiddenReplies[29913] ? 'expand_more' : 'expand_less'}} hide replies show replies
sharptoothed sharptoothed September 24, 2018 September 24, 2018 at 10:33:10 AM UTC link Permalink

I have no details to add, actually. I had no time to check the data on those dates so I just processed the dumps and updated my databases blindly.

TRANG TRANG September 26, 2018 September 26, 2018 at 10:41:30 AM UTC link Permalink

Sorry I broke your tool!

For the first problem, which is that you're now missing data, I have to say that at this stage I am not confident it's okay for us to export the registration date. It's kind of personal and we're doing this without explicit consent from the users. To be honest, the registration date is not the only thing. Anything that is user related is, generally speaking, questionable to export without the user's explicit consent.

Gillux expressed the issue quite well in this comment:
https://github.com/Tatoeba/tato...ment-180814467

We're not doing anything outrageous in terms of privacy violation, but we're far from doing the best we can. Back when we started exporting more and more data, we really didn't think much about privacy and we still haven't taken the time to seriously think/discuss about this topic.

Perhaps now is an occasion...

I personally feel uncomfortable exporting user related data, knowing that we have not asked permission. We just figured it's useful/interesting and assumed everyone was fine with us exporting this data. That's maybe not the case, maybe some people are not fine with that.

If you (or anyone) think differently about this issue, feel free to share your opinion.


As for the other problem, in the GitHub issue, gillux asked:

> How to keep the format of exported data consistent over time,
> so that people can use it without fearing unexpected changes?

In this particular case, we could have kept the same number of columns in the CSV file but just put NULL values in the columns we wanted to remove. This would have at least prevented your scripts to break. If that's still relevant, we can push a fix for that.

But I think keeping the format consistent is only half of the problem. In some cases it may be impossible or not reasonable for us to keep the format consistent indefinitely.

The other half of the problem is about how to communicate and synchronize the changes with the consumers of our data. Maybe it's time we revive our Google Group. Or maybe there is another more appropriate platform. But this incident shows we need to put more efforts into being more connected with those who reuse our data.

{{vm.hiddenReplies[30004] ? 'expand_more' : 'expand_less'}} hide replies show replies
sharptoothed sharptoothed September 26, 2018, edited September 27, 2018 September 26, 2018 at 12:07:05 PM UTC, edited September 27, 2018 at 11:47:22 AM UTC link Permalink

Thanks for the clarification, Trang! :-)
Indeed, personal data is quite a delicate question since even seemingly unimportant data may suddenly be found important. In this sense, any personal info and other user related data shouldn't be exported, nor should it be visible on the web-site unless the user permitted it explicitly. Parsing the contents of a web-page is just a little bit harder than parsing the contents of a data dump, you know. :-)

As for the file format change, I've adapted my scripts and it's not an issue any more. Maybe it's worth to switch from CSV data format to XML or, say, JSON to avoid such problems in future.

Aiji Aiji September 27, 2018, edited September 28, 2018 September 27, 2018 at 10:11:02 AM UTC, edited September 28, 2018 at 8:01:45 AM UTC link Permalink

I think the same about user personal data. It's fine to use the data to perform analysis intra-Tatoeba, but exporting the content of a profile is not really fine actually.
Besides, thinking farther than "it's cool to have data exported", I cannot really think of any relevant data for extra-Tatoeba tools, except the proficiency of the languages. Would there be any other?

About the change of format, I don't think that keeping the same format forever to ensure tools compatibility is the best solution. Once in a wild, as you said, the format HAS to change. That's why every software has so many versions^^
I'm not very kind so for me it's just fine to do your changes, and somehow spread the message that there's is a new format (just need to find how to spread the message^^). If the tool is currently used / monitored, people will eventually see the problem. (Lost data due to incompatibility is another issue, but it's not really your fault).

Another option is to duplicate but... I mean you keep the last version of old format at its place and put the version of new format somewhere else. But other problems occur, people don't realize their files are not updated, etc. That's an industry-wide issue so... :)