gillux's Wall messages

gillux {{ icon }}

keyboard_arrow_right

Profile

keyboard_arrow_right

Sentences

keyboard_arrow_right

Vocabulary

keyboard_arrow_right

Reviews

keyboard_arrow_right

Lists

keyboard_arrow_right

Favorites

keyboard_arrow_right

Comments

keyboard_arrow_right

Comments on gillux's sentences

keyboard_arrow_right

Wall messages

keyboard_arrow_right

Logs

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptions

translate

Translate gillux's sentences

gillux April 30, 2022 April 30, 2022 at 8:34:53 PM UTC

link

Permalink

Very nice work :-)

gillux April 24, 2022, edited April 30, 2022 April 24, 2022 at 11:35:20 PM UTC, edited April 30, 2022 at 8:07:06 PM UTC

link

Permalink

I checked our webserver logs to see which websites hotlink audio files from Tatoeba. To my surprise, I found 17 unique websites over the past two weeks. Mostly online language learning platforms. I could see that many of them do not give proper attribution, but some also violate the non-commercial clause of some audio, as well as the "no offsite use" clause.

On the one hand, I could contact each of them to ask to give proper attribution or stop commercial use. As a developer and sysadmin of Tatoeba I could even prevent hotlinking and effectively break their platform.

On the other hand, the general goal of Tatoeba is to spread knowledge and I don’t want to mess around just for the sake of some licence restrictions. After all, only the member who created the audio can legally complain.

So should Tatoeba do something about that?

On a side note, I’d like to point out that many of our audio contributors use a Creative Commons licence with a non-commercial restriction. While this immediately makes sense (don’t want to have somebody making profit out of something I put a lot of effort into), in practice, the line between commercial and non-commercial is quite blurry, and there are lots of use cases involving money exchange that are pretty fair. For example, I would not be able to include such audio in a presentation made at a convention or conference (like Polyglot Conference Global), just because there is an entrance fee. More about that: https://kefletcher.blogspot.com...ommercial.html

gillux February 13, 2022 February 13, 2022 at 10:43:41 PM UTC

link

Permalink

Quelle tristesse… Ce sont toujours les meilleurs qui partent en premier. :'-(

gillux October 15, 2021 October 15, 2021 at 1:37:24 PM UTC

link

Permalink

I’m interested too. :-)

gillux September 14, 2021 September 14, 2021 at 11:29:17 PM UTC

link

Permalink

You are right. We chose to default the license to "no offsite use" at the time we were gathering information about past audio contributions. In the early days of the project, audio used to contributed without thinking about the licenses and without even properly attributing contributors. Around 2016, we worked on adding audio metadata and proper attribution. We contacted past audio contributors and searched through the history of the project. For some audio recordings though we weren’t able to find attribution (or the contributor just didn’t choose a license), that’s why the juridically safe default "no offsite use" was set. However, later on, on the 20th of January 2019, we introduced new Terms of Use (see the link in the footer) that explicitly default audio contributions to CC-BY. Consequently, we should also update the default audio license for all users who accepted these latest terms. @TRANG Can you confirm?

gillux September 14, 2021 September 14, 2021 at 1:21:45 AM UTC

link

Permalink

I confirm it is just the audio you have searched for.
It could be a user choice, or it could also be a lack of choice.
Audio recordings contributed by users who did not select a license default to "No license for offsite use".

gillux September 14, 2021 September 14, 2021 at 1:18:57 AM UTC

link

Permalink

These were erroneous lines. I fixed the problem. Thanks!

gillux September 13, 2021 September 13, 2021 at 11:30:23 PM UTC

link

Permalink

Note that for technical reasons each query is reported twice, so you want to divide numbers by two.

gillux September 13, 2021 September 13, 2021 at 11:11:50 PM UTC

link

Permalink

Yes, it would be better to automatically schedule this task. Please open an issue on Github so that we don’t forget to do it.

gillux September 11, 2021 September 11, 2021 at 2:02:44 PM UTC

link

Permalink

I lost the command line I used at that time. I did it again today and I uploaded the command to the repository. This is just a quick and dirty sed one-liner. We could possibly have Manticore produce the logs directly in a format usable by Tatominer, that would make things easier.

By the way, this new command does not produce records with empty queries, unlike before.

gillux September 11, 2021 September 11, 2021 at 1:55:19 PM UTC

link

Permalink

I updated the queries.csv file today. You can download the same file again to get the updated version.

By the way, if you'd like I can also include the number of results each query returned at the time it was run.

gillux September 11, 2021 September 11, 2021 at 1:15:14 AM UTC

link

Permalink

Well, some work has been done towards introducing that feature. When the issue was opened, in 2014, there were absolutely no information regarding audio recording on Tatoeba except "there is an audio for that sentence" (a boolean information for each sentence). Later on, in 2016, I worked on indicating contributors of audio: https://github.com/Tatoeba/tatoeba2/pull/1378 (audio metadata for each sentence). Technically speaking this paved the way for multiple recordings per sentence, although nobody worked on that specifically afterwards. I understand however that from a user’s perspective it feels like very little changed.

Note that anyone is welcome to contribute code. If you have programming skills, or if you know somebody who does and can convince this person to help, more features could be added.

gillux August 14, 2021 August 14, 2021 at 1:15:31 PM UTC

link

Permalink

I am a bit late to the party but I’d like to add something.

I can think of two ways near-duplicate sentences are getting into my way when I use the search.

1. I am looking for various uses and contexts for a given keyword, but results are cluttered with near-duplicates, so I need to scroll and click "next page" until I find a more different sentence. Note that the default sort order "Relevance" makes it all the more likely that near-duplicate are going to follow one another. As an experienced user, I know that I can use the "Random" sort order to get around that problem, but I really don’t this it’s obvious for newcomers.

2. To make matters worse, sentences showing in the results will likely have near-duplicates appearing as indirect translations too, which is duplicating information all over. For example, given near-duplicate sentences A, B and C, all translations of X, showing in the results, we see:

- A
Translations
-- X
Translations of translations
-- B
-- C

- B
Translations
-- X
Translations of translations
-- A
-- C

- C
Translations
-- X
Translations of translations
-- A
-- B

The problem is that we are seeing A, B and C three times each. The information is duplicated all around, it feels messy and overloaded.

I don’t think near-duplicates are a problem per se, but the way we display search results sometimes makes me feel that they are getting in my way. But I think it’s just a matter of information display. If we had a way to automatically cluster near-duplicates (maybe with some AI), maybe we could somehow display them as groups, so that you can either browse into them or filter them out depending on your use case.

gillux May 19, 2021 May 19, 2021 at 9:50:08 PM UTC

link

Permalink

This problem (and especially wuu/shanghainese) have been discussed and partially solved in that thread https://github.com/Tatoeba/tatoeba2/issues/1670 but there is still some work to do. As much as we’d like to fix all the language names, each renaming needs to be studied with insight from native speakers and contributors. I suggest contacting sabretou if you want to help or inquire about it.

gillux May 17, 2021 May 17, 2021 at 9:22:21 PM UTC

link

Permalink

Thank you for helping us with this! I updated tatoeba.org today so your fixes are now online.

gillux May 17, 2021 May 17, 2021 at 9:20:05 PM UTC

link

Permalink

What a great idea! I am sure you will do great. You can inspire the audience with your love and dedication to the project.

gillux May 17, 2021, edited May 17, 2021 May 17, 2021 at 6:43:23 PM UTC, edited May 17, 2021 at 6:43:36 PM UTC

link

Permalink

This was also a temporary error. Thank you for letting us know!

gillux May 13, 2021 May 13, 2021 at 7:39:48 PM UTC

link

Permalink

I took note of your need.

gillux May 11, 2021 May 11, 2021 at 12:12:42 PM UTC

link

Permalink

The video on that page, though dated, can also give you some answers: https://tatoeba.org/fra/about

gillux May 11, 2021 May 11, 2021 at 12:08:26 PM UTC

link

Permalink

I looked into this problem. It happens because of a mistranslation of the Spanish UI. The date is translated here: https://www.transifex.com/tatoe...urce/186665868 (this link requires a Transifex account). The total number of sentences is here: https://www.transifex.com/tatoe...urce/163036559

You are welcome to improve the Spanish translation if you’d like. See the instructions here: https://en.wiki.tatoeba.org/art...ce-translation

Need some help?

Developers

About

gillux's messages on the Wall (total 595)

Need some help?

Developers

About