Profile
Sentences
Vocabulary
Reviews
Lists
Favorites
Comments
Comments on gillux's sentences
Wall messages
Logs
Audio
Transcriptions
Translate gillux's sentences
Very nice work :-)
I checked our webserver logs to see which websites hotlink audio files from Tatoeba. To my surprise, I found 17 unique websites over the past two weeks. Mostly online language learning platforms. I could see that many of them do not give proper attribution, but some also violate the non-commercial clause of some audio, as well as the "no offsite use" clause.
On the one hand, I could contact each of them to ask to give proper attribution or stop commercial use. As a developer and sysadmin of Tatoeba I could even prevent hotlinking and effectively break their platform.
On the other hand, the general goal of Tatoeba is to spread knowledge and I don’t want to mess around just for the sake of some licence restrictions. After all, only the member who created the audio can legally complain.
So should Tatoeba do something about that?
On a side note, I’d like to point out that many of our audio contributors use a Creative Commons licence with a non-commercial restriction. While this immediately makes sense (don’t want to have somebody making profit out of something I put a lot of effort into), in practice, the line between commercial and non-commercial is quite blurry, and there are lots of use cases involving money exchange that are pretty fair. For example, I would not be able to include such audio in a presentation made at a convention or conference (like Polyglot Conference Global), just because there is an entrance fee. More about that: https://kefletcher.blogspot.com...ommercial.html
Quelle tristesse… Ce sont toujours les meilleurs qui partent en premier. :'-(
I’m interested too. :-)
You are right. We chose to default the license to "no offsite use" at the time we were gathering information about past audio contributions. In the early days of the project, audio used to contributed without thinking about the licenses and without even properly attributing contributors. Around 2016, we worked on adding audio metadata and proper attribution. We contacted past audio contributors and searched through the history of the project. For some audio recordings though we weren’t able to find attribution (or the contributor just didn’t choose a license), that’s why the juridically safe default "no offsite use" was set. However, later on, on the 20th of January 2019, we introduced new Terms of Use (see the link in the footer) that explicitly default audio contributions to CC-BY. Consequently, we should also update the default audio license for all users who accepted these latest terms. @TRANG Can you confirm?
I confirm it is just the audio you have searched for.
It could be a user choice, or it could also be a lack of choice.
Audio recordings contributed by users who did not select a license default to "No license for offsite use".
These were erroneous lines. I fixed the problem. Thanks!
Note that for technical reasons each query is reported twice, so you want to divide numbers by two.
Yes, it would be better to automatically schedule this task. Please open an issue on Github so that we don’t forget to do it.
I lost the command line I used at that time. I did it again today and I uploaded the command to the repository. This is just a quick and dirty sed one-liner. We could possibly have Manticore produce the logs directly in a format usable by Tatominer, that would make things easier.
By the way, this new command does not produce records with empty queries, unlike before.
I updated the queries.csv file today. You can download the same file again to get the updated version.
By the way, if you'd like I can also include the number of results each query returned at the time it was run.
Well, some work has been done towards introducing that feature. When the issue was opened, in 2014, there were absolutely no information regarding audio recording on Tatoeba except "there is an audio for that sentence" (a boolean information for each sentence). Later on, in 2016, I worked on indicating contributors of audio: https://github.com/Tatoeba/tatoeba2/pull/1378 (audio metadata for each sentence). Technically speaking this paved the way for multiple recordings per sentence, although nobody worked on that specifically afterwards. I understand however that from a user’s perspective it feels like very little changed.
Note that anyone is welcome to contribute code. If you have programming skills, or if you know somebody who does and can convince this person to help, more features could be added.
I am a bit late to the party but I’d like to add something.
I can think of two ways near-duplicate sentences are getting into my way when I use the search.
1. I am looking for various uses and contexts for a given keyword, but results are cluttered with near-duplicates, so I need to scroll and click "next page" until I find a more different sentence. Note that the default sort order "Relevance" makes it all the more likely that near-duplicate are going to follow one another. As an experienced user, I know that I can use the "Random" sort order to get around that problem, but I really don’t this it’s obvious for newcomers.
2. To make matters worse, sentences showing in the results will likely have near-duplicates appearing as indirect translations too, which is duplicating information all over. For example, given near-duplicate sentences A, B and C, all translations of X, showing in the results, we see:
- A
Translations
-- X
Translations of translations
-- B
-- C
- B
Translations
-- X
Translations of translations
-- A
-- C
- C
Translations
-- X
Translations of translations
-- A
-- B
The problem is that we are seeing A, B and C three times each. The information is duplicated all around, it feels messy and overloaded.
I don’t think near-duplicates are a problem per se, but the way we display search results sometimes makes me feel that they are getting in my way. But I think it’s just a matter of information display. If we had a way to automatically cluster near-duplicates (maybe with some AI), maybe we could somehow display them as groups, so that you can either browse into them or filter them out depending on your use case.
This problem (and especially wuu/shanghainese) have been discussed and partially solved in that thread https://github.com/Tatoeba/tatoeba2/issues/1670 but there is still some work to do. As much as we’d like to fix all the language names, each renaming needs to be studied with insight from native speakers and contributors. I suggest contacting sabretou if you want to help or inquire about it.
Thank you for helping us with this! I updated tatoeba.org today so your fixes are now online.
What a great idea! I am sure you will do great. You can inspire the audience with your love and dedication to the project.
This was also a temporary error. Thank you for letting us know!
I took note of your need.
The video on that page, though dated, can also give you some answers: https://tatoeba.org/fra/about
I looked into this problem. It happens because of a mistranslation of the Spanish UI. The date is translated here: https://www.transifex.com/tatoe...urce/186665868 (this link requires a Transifex account). The total number of sentences is here: https://www.transifex.com/tatoe...urce/163036559
You are welcome to improve the Spanish translation if you’d like. See the instructions here: https://en.wiki.tatoeba.org/art...ce-translation