menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
gillux {{ icon }} keyboard_arrow_right

Profile

keyboard_arrow_right

Sentences

keyboard_arrow_right

Vocabulary

keyboard_arrow_right

Reviews

keyboard_arrow_right

Lists

keyboard_arrow_right

Favorites

keyboard_arrow_right

Comments

keyboard_arrow_right

Comments on gillux's sentences

keyboard_arrow_right

Wall messages

keyboard_arrow_right

Logs

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptions

translate

Translate gillux's sentences

gillux's messages on the Wall (total 595)

gillux gillux April 30, 2022 April 30, 2022 at 8:34:53 PM UTC link Permalink

Very nice work :-)

gillux gillux April 24, 2022, edited April 30, 2022 April 24, 2022 at 11:35:20 PM UTC, edited April 30, 2022 at 8:07:06 PM UTC link Permalink

I checked our webserver logs to see which websites hotlink audio files from Tatoeba. To my surprise, I found 17 unique websites over the past two weeks. Mostly online language learning platforms. I could see that many of them do not give proper attribution, but some also violate the non-commercial clause of some audio, as well as the "no offsite use" clause.

On the one hand, I could contact each of them to ask to give proper attribution or stop commercial use. As a developer and sysadmin of Tatoeba I could even prevent hotlinking and effectively break their platform.

On the other hand, the general goal of Tatoeba is to spread knowledge and I don’t want to mess around just for the sake of some licence restrictions. After all, only the member who created the audio can legally complain.

So should Tatoeba do something about that?

On a side note, I’d like to point out that many of our audio contributors use a Creative Commons licence with a non-commercial restriction. While this immediately makes sense (don’t want to have somebody making profit out of something I put a lot of effort into), in practice, the line between commercial and non-commercial is quite blurry, and there are lots of use cases involving money exchange that are pretty fair. For example, I would not be able to include such audio in a presentation made at a convention or conference (like Polyglot Conference Global), just because there is an entrance fee. More about that: https://kefletcher.blogspot.com...ommercial.html

gillux gillux February 13, 2022 February 13, 2022 at 10:43:41 PM UTC link Permalink

Quelle tristesse… Ce sont toujours les meilleurs qui partent en premier. :'-(

gillux gillux October 15, 2021 October 15, 2021 at 1:37:24 PM UTC link Permalink

I’m interested too. :-)

gillux gillux September 14, 2021 September 14, 2021 at 11:29:17 PM UTC link Permalink

You are right. We chose to default the license to "no offsite use" at the time we were gathering information about past audio contributions. In the early days of the project, audio used to contributed without thinking about the licenses and without even properly attributing contributors. Around 2016, we worked on adding audio metadata and proper attribution. We contacted past audio contributors and searched through the history of the project. For some audio recordings though we weren’t able to find attribution (or the contributor just didn’t choose a license), that’s why the juridically safe default "no offsite use" was set. However, later on, on the 20th of January 2019, we introduced new Terms of Use (see the link in the footer) that explicitly default audio contributions to CC-BY. Consequently, we should also update the default audio license for all users who accepted these latest terms. @TRANG Can you confirm?

gillux gillux September 14, 2021 September 14, 2021 at 1:21:45 AM UTC link Permalink

I confirm it is just the audio you have searched for.
It could be a user choice, or it could also be a lack of choice.
Audio recordings contributed by users who did not select a license default to "No license for offsite use".

gillux gillux September 14, 2021 September 14, 2021 at 1:18:57 AM UTC link Permalink

These were erroneous lines. I fixed the problem. Thanks!

gillux gillux September 13, 2021 September 13, 2021 at 11:30:23 PM UTC link Permalink

Note that for technical reasons each query is reported twice, so you want to divide numbers by two.

gillux gillux September 13, 2021 September 13, 2021 at 11:11:50 PM UTC link Permalink

Yes, it would be better to automatically schedule this task. Please open an issue on Github so that we don’t forget to do it.

gillux gillux September 11, 2021 September 11, 2021 at 2:02:44 PM UTC link Permalink

I lost the command line I used at that time. I did it again today and I uploaded the command to the repository. This is just a quick and dirty sed one-liner. We could possibly have Manticore produce the logs directly in a format usable by Tatominer, that would make things easier.

By the way, this new command does not produce records with empty queries, unlike before.

gillux gillux September 11, 2021 September 11, 2021 at 1:55:19 PM UTC link Permalink

I updated the queries.csv file today. You can download the same file again to get the updated version.

By the way, if you'd like I can also include the number of results each query returned at the time it was run.

gillux gillux September 11, 2021 September 11, 2021 at 1:15:14 AM UTC link Permalink

Well, some work has been done towards introducing that feature. When the issue was opened, in 2014, there were absolutely no information regarding audio recording on Tatoeba except "there is an audio for that sentence" (a boolean information for each sentence). Later on, in 2016, I worked on indicating contributors of audio: https://github.com/Tatoeba/tatoeba2/pull/1378 (audio metadata for each sentence). Technically speaking this paved the way for multiple recordings per sentence, although nobody worked on that specifically afterwards. I understand however that from a user’s perspective it feels like very little changed.

Note that anyone is welcome to contribute code. If you have programming skills, or if you know somebody who does and can convince this person to help, more features could be added.

gillux gillux August 14, 2021 August 14, 2021 at 1:15:31 PM UTC link Permalink

I am a bit late to the party but I’d like to add something.

I can think of two ways near-duplicate sentences are getting into my way when I use the search.

1. I am looking for various uses and contexts for a given keyword, but results are cluttered with near-duplicates, so I need to scroll and click "next page" until I find a more different sentence. Note that the default sort order "Relevance" makes it all the more likely that near-duplicate are going to follow one another. As an experienced user, I know that I can use the "Random" sort order to get around that problem, but I really don’t this it’s obvious for newcomers.

2. To make matters worse, sentences showing in the results will likely have near-duplicates appearing as indirect translations too, which is duplicating information all over. For example, given near-duplicate sentences A, B and C, all translations of X, showing in the results, we see:

- A
Translations
-- X
Translations of translations
-- B
-- C

- B
Translations
-- X
Translations of translations
-- A
-- C

- C
Translations
-- X
Translations of translations
-- A
-- B

The problem is that we are seeing A, B and C three times each. The information is duplicated all around, it feels messy and overloaded.

I don’t think near-duplicates are a problem per se, but the way we display search results sometimes makes me feel that they are getting in my way. But I think it’s just a matter of information display. If we had a way to automatically cluster near-duplicates (maybe with some AI), maybe we could somehow display them as groups, so that you can either browse into them or filter them out depending on your use case.

gillux gillux May 19, 2021 May 19, 2021 at 9:50:08 PM UTC link Permalink

This problem (and especially wuu/shanghainese) have been discussed and partially solved in that thread https://github.com/Tatoeba/tatoeba2/issues/1670 but there is still some work to do. As much as we’d like to fix all the language names, each renaming needs to be studied with insight from native speakers and contributors. I suggest contacting sabretou if you want to help or inquire about it.

gillux gillux May 17, 2021 May 17, 2021 at 9:22:21 PM UTC link Permalink

Thank you for helping us with this! I updated tatoeba.org today so your fixes are now online.

gillux gillux May 17, 2021 May 17, 2021 at 9:20:05 PM UTC link Permalink

What a great idea! I am sure you will do great. You can inspire the audience with your love and dedication to the project.

gillux gillux May 17, 2021, edited May 17, 2021 May 17, 2021 at 6:43:23 PM UTC, edited May 17, 2021 at 6:43:36 PM UTC link Permalink

This was also a temporary error. Thank you for letting us know!

gillux gillux May 13, 2021 May 13, 2021 at 7:39:48 PM UTC link Permalink

I took note of your need.

gillux gillux May 11, 2021 May 11, 2021 at 12:12:42 PM UTC link Permalink

The video on that page, though dated, can also give you some answers: https://tatoeba.org/fra/about

gillux gillux May 11, 2021 May 11, 2021 at 12:08:26 PM UTC link Permalink

I looked into this problem. It happens because of a mistranslation of the Spanish UI. The date is translated here: https://www.transifex.com/tatoe...urce/186665868 (this link requires a Transifex account). The total number of sentences is here: https://www.transifex.com/tatoe...urce/163036559

You are welcome to improve the Spanish translation if you’d like. See the instructions here: https://en.wiki.tatoeba.org/art...ce-translation