clear
swap_horiz
search

gillux's messages on the Wall (total 381)

keyboard_arrow_left 1234567...20
gillux
yesterday
Je suis contre.

Tu dis que "on peut considérer les deux premiers cas comme des cas incorrects, et les deux derniers comme des cas corrects". Je ne suis pas d'accord, car les choses ne sont malheureusement pas si simples. L'ennui, c'est qu'il n'existe (à ma connaissance) aucune instance normative de la typographie. En d'autres termes, personne n'a l'autorité pour dire "ça c'est faux", "ça c'est juste". Il n'y a que des conventions. On peut par exemple regarder comment fait l'Imprimerie nationale, car ils sont super calés, mais ce ne sont pas eux qui dictent les règles. En fait, personne ne dicte aucune règle, il n'y a que des usages, des conventions, suivient par certains et pas par d'autres. Par exemple, les québécois préfèrent omettre l'espace devant les signes de ponctuation doubles :
http://www.guylabbe.ca/blog/reg...ec-france.html

Dans ce contexte, il est difficile d'affirmer que ceci est correct et que cela ne l'est pas. A contrario, l'orthographe et la grammaire sont normalisées en France par l'Académie Française, qui publie son dico et ses textes dans le journal officiel sur lesquels on peut s'appuyer pour dire qu'un mot est correct ou non (même si tout le monde n'est pas d'accord avec eux).

C'est pourquoi, même si l'idée de normaliser la typographie dans le corpus français est alléchante car elle simplifie les choses (et en tant que développeur du site, j'aime quand les choses sont simples), je m'y oppose. Cela reviendrait à imposer une façon d'écrire le français et d'ignorer les autres, qui sont pourtant usités et pas incorrectes. Or, je pense que sur Tatoeba, on cherche plus à inclure tous les usages. C'est une question de principe. Pour moi, c'est un peu comme si on essayait d'imposer les caractères simplifiés en Mandarin (même si le cas du Mandarin est beaucoup plus polémique).

Je suis conscient que la pluralité des typographies gêne la détection des doublons par Horus. Mais pour moi, le problème ne vient pas du corpus, il vient d'Horus, ou de notre façon d'organiser le corpus. Il ne faut pas adapter le corpus aux outils ; ce sont aux outils de s'adapter. Je veux bien essayer de m'y coller, si j'ai le temps, mais un autre problème de la typographie, c'est qu'elle est infiniment complexe. Ça a l'air simple quand on parle des espaces devant les points d'exclamation, mais ajoute à ça rien que les dialogues, les citations imbriquées et l'italique, et ça devient une autre paire de manches ! Je t'invite à lire Orthotypographie de Lacroux pour t'en convaincre.
gillux
12 days ago
The search feature is back!
gillux
2018-07-19 19:26
Thank you for reporting the problem, PaulP. I added it on our bugtracker: https://github.com/Tatoeba/tatoeba2/issues/1614
gillux
2018-07-09 15:04
Sorry, my mistake! The problem should be fixed now.
gillux
2018-07-04 22:25 - 2018-07-04 22:29
Note that we currently ignore Arabic question mark as well as other punctuation characters. The complete list of code points currently treated as part of words (not ignored) can be found here: https://github.com/Tatoeba/tato...ll.php#L80-L82

The meaning of a line like:

'U+621..U+63a', 'U+640..U+64a'

Is: "treat characters from code point U+621 to U+63A as part of words, then ignore characters from U+63B to U+63F, and then treat characters from U+640 to U+64A as part of words"
gillux
2018-07-04 22:16
> I think that the punctuation marks (especially the question mark) shouldn't be ignored.

I’m surprised. If the meaning of the Arabic question mark is the same as in English, it should be ignored. Let me explain more in details the effect of ignoring a character in the search.

For each sentence, the search engine makes a list of all the words it contains. This way, when you look up a word, it can quickly find all the sentences that include it. This process of extracting all the words contained in a sentence is done by using two lists of characters. One list (let’s call it I) consists of all the characters a word can include. Another list (let’s call it E) consists of all the characters that are not part of any word. List E are the characters that separate words (like space, punctuation) while list I are the ones that make up words (like letters). Now let’s have a look at a concrete example:

List I = abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
List E = .:;,?!'()[]- (plus the space character)
Sentence = I'm old-fashioned?

In that sentence, the following 4 characters are in list E: '-? (and space).
All the rest is in list I.
Thus, the search engine finds 4 words: I, m, old, fashioned.
It works as expected because people can find the sentence using one of these 4 words.

Now, let’s take the same sentence and remove the question mark from list E. The search engine now finds 4 words: I, m, old, fashioned?
Here, the question mark is treated as a normal character, just like a letter. It’s part of the word. Consequently, if you look up "fashioned", you won’t find the sentence. You have to look up "fashioned?" instead.

When we say "character X should be ignored", it means "character X should be in list E". If you (or anybody else) can use that Wikipedia article as a reference list to sort out which characters should be in list E or list I, we can improve the search of Arabic sentences.
gillux
2018-07-03 20:28
Thank you odexed. Your comment suggests that we shouldn’t blindly ignore all the diacritics.

> The best solution to this problem would be to make the search engine adjustable

I agree, but it’s not possible at the moment.
gillux
2018-07-03 14:53
Thank you for reporting this, OsoHombre. Yes, it’s possible to tweak the search engine to ignore Arabic diacritics. But we need your help because we don’t know Arabic, so we don’t know exactly what we’re doing when we try to make such changes.

First of all, you need to be aware that if we make the search engine ignore diacritics, visitors won’t be able to look up only "الشعب" or to look up only "الشّعْبُ", because these words would be treated as if they were the same word. In other words, searching for either of them would produce the same results, and these results would include sentences with both spelling. So we need to make sure that such change doesn’t get into the way of many people who would like to look up only "الشّعْبُ", for example. Now, if you’re sure that ignoring diacritics is fine, we need a complete list of all the diacritics that needs to be ignored, and we’ll tweak the search engine accordingly.
gillux
2018-07-02 14:34
We ran into an update problem that prevented about half of the members from logging in normally for the past 23 hours. The problem should be fixed now. Sorry for the inconvenience.

When trying to log in, members affected by the problem were being told that their password was wrong, even though it was correct. One way to circumvent the problem was to use the "Forgot password" feature to reset your password. If you did so, don’t forget to change it after successfully logging in.
gillux
2018-06-29 21:32
Hello BRW and welcome,

It’s always good to get opinions from new contributors because you’re not too much "used" to the quirks of this website yet. :-)

I think you have a point here. Indentation could be used to better show the relationships between sentences. However, our structure is not a tree but a graph. In your example, let’s say there is a sentence C that is a translation of both A and B. It makes C a "translation of translation" of XYZ, but then where should we show it? Under A or B? Or maybe under both of them?

About your other suggestion, the contents of a sentence is already shown on hovering a sentence number (for example in your initial post in this thread). About showing the number when hovering text, can you explain in which kind of situation it would be useful to you?
gillux
2018-06-18 16:52
This would definitely be a useful feature. Unfortunately, it’s not technically easy to implement it, because of the way the search is currently performed. I created an issue on our bugtracker to keep track of your suggestion: https://github.com/Tatoeba/tatoeba2/issues/1576
gillux
2018-06-02 07:24
Merci Trang de m’avoir embauché ! C’est un honneur pour moi d’avoir la chance de travailler pour Tatoeba.

Pour ceux qui ne me connaissent pas, j’ai contribué de manière bénévole à l’amélioration du site, en particulier en 2014 et 2015. J’ai principalement travaillé à améliorer la fonctionnalité de recherche de phrases et l’intégration des écritures alternatives et des transcriptions (pour les langues qui ont plusieurs systèmes d’écriture, comme le Chinois, le Japonais, l’Ouzbek etc.). J’ai aussi participé à la maintenance du serveur, et dans une moindre mesure, j’ai travaillé du côté des enregistrements audio, notamment afin d’accorder davantage de reconnaissance dans le site aux contributeurs qui s’enregistrent. Enfin, en tant que membre du site, je suis également un modeste contributeur du corpus français.

Début 2016, j’avais arrêté de contribuer à Tatoeba pour me consacrer à d’autres activités, et je reviens maintenant en tant que salarié. J’ai été embauché dans le cadre d’une collaboration entre Tatoeba et Mozilla pour leur projet Common Voice[1]. Mozilla souhaite pouvoir utiliser les phrases de Tatoeba, mais il y a beaucoup de travail à faire pour rendre cela possible, tant sur le plan technique que légal. Je travaillerai avec Trang qui peut maintenant, elle aussi, se consacrer davantage à Tatoeba. Bien sûr, toute contribution bénévole est aussi la bienvenue.

Je pense que cette collaboration avec Mozilla peut apporter énormément à Tatoeba. Bien qu’un certain nombre de projets utilisent notre corpus[2], dans les faits il est assez difficile et peu pratique de s’en servir à cause de nombreux obstacles techniques (et parfois juridiques). Si nous parvenons à faciliter l’usage du corpus pour Mozilla, alors c’est tous les autres projets qui s’en servent ou voudraient s’en servir qui bénéficieront de ces améliorations. Tatoeba pourrait ainsi devenir une ressource plus connue et plus utilisée, et cela nous pousserait à être plus exigeants avec nous-mêmes. Nous voulons également à terme améliorer la qualité des phrases, et je pense que cela aura beaucoup plus de sens lorsqu’il y aura davantage de gens qui seront demandeurs de cette qualité.

1. https://tatoeba.org/wall/show_m...#message_29186
2. http://a4esl.org/temporary/tatoeba/links.html

=============================================

Thanks Trang for hiring me! It’s an honor for me to be given the opportunity to work for Tatoeba.

For those who don’t know me, I volunteered to help improving the website, especially in 2014 and 2015. I mainly worked on improving the sentence search functionality and the integration of alternative scripts and transcriptions (for languages that have several writing systems, like Chinese, Japanese, Uzbek etc.). I also participated in the maintenance of the server, and to a lesser extent, I worked on the audio recordings side, especially to give more credit in the website to contributors who record themselves. Finally, as a member of the website, I am also a modest contributor to the French corpus.

In the beginning of 2016, I stopped contributing to Tatoeba to focus on other activities, and I’m now back as an employee. I was hired as part of a collaboration between Tatoeba and Mozilla for their Common Voice project[1]. Mozilla wants to be able to use Tatoeba's sentences, but a lot of work has to be done to make this technically and legally possible. I will work with Trang, who can now also devote more time to Tatoeba. Of course, any voluntary contribution is also welcome.

I think this collaboration with Mozilla can bring a lot to Tatoeba. Although some projects do use our corpus[2], in practice it’s rather difficult and impractical to use, because of numerous technical (and sometimes legal) obstacles. If we can facilitate the use of our corpus for Mozilla, then all other projects that use it or would like to use it will benefit from these improvements. Tatoeba could thus become a more known and used resource, and this would push us to be more demanding with ourselves. We also want to eventually improve the quality of the sentences, and I think this will make much more sense when we will have more people asking for such quality.

1. https://tatoeba.org/wall/show_m...#message_29186
2. http://a4esl.org/temporary/tatoeba/links.html
gillux
2017-02-17 03:50
I strongly believe that we should not change our way of writing sentences for technical reasons. Programs should adapt to languages, not the opposite.

How about relating near-duplicate sentences with a fuzzy matching algorithm? So that for example, on a given sentence page, one could see a list of near-duplicates, along with their translations. I believe such an algorithm could be quite effective, even if it can’t be perfect.
gillux
2017-02-13 16:48
Are you using the new address https://tatoeba.org/audio/import ?
gillux
2016-12-26 06:39
*** Improving search for Chinese (Mandarin), Cantonese and Uzbek sentences ***

TL;DR: If you are knowledgeable in Mandarin or Cantonese, I’d appreciate you have a look at this ongoing work: https://github.com/Tatoeba/tatoeba2/pull/1379

On Tatoeba, Chinese sentences can be written using either simplified or traditional characters. While this allows members to use the characters they prefer, it makes it hard to look up sentences, because searching using traditional or simplified characters will only show sentences written as such. So currently, in order to find all the sentences, one has to perform one search using simplified characters and another search using the equivalent traditional characters. Uzbek, which can be written in either Latin or Cyrillic, suffer from the same problem.

Following my previous work on editable transcriptions, I am now trying to address this problem by allowing to find Chinese and Uzbek sentences regardless of their script.

Additionally, this will allow to find sentences by their transcriptions. That is to say, Japanese sentences may be found using kanji readings in kana, Chinese sentences using Pinyin and Cantonese using Jyutping. Regarding this particular point, I’d like to hear the opinion of whoever’s knowledgeable in Chinese and Cantonese about the problems I mentioned there: https://github.com/Tatoeba/tatoeba2/pull/1379
gillux
2016-12-16 05:40
On dev.tatoeba.org, I can see 524 lists public lists, and the drop-down only includes these 524 lists, under the Other lists section.

I think you got that, but I’m just clarifying: unlisted lists don’t show up in the drop-down.

Selecting a list is just as bad as selecting a language, but people seem to live with that.
gillux
2016-12-16 05:28
> what about making the sentences in an unknown language searchable until a proper language for them has been implemented?

While I think it is technically possible, note that it won’t show any results for languages that use a script that is not yet used in any other language included in Tatoeba.
gillux
2016-12-11 08:05
Hello kamitoki,

> can i get the audio files in one download?

No, we don’t provide such functionality. But if you know a bit of scripting, it’s rather easy to automate the download of the files by using the list of sentences with audio from the Downloads page.

I’m curious, though, about the reason you wish to download all the audio files at once, despite them being in various languages. What do you want to accomplish? Maybe one could come up with a different solution for your problem if you describe it.
gillux
2016-12-02 12:54
When I say something is a feature, I mean it has been programmed this way on purpose. I’m not saying it’s good or bad. (The emoticon in my previous message was rather sarcastic, as the sentence “It’s not a bug, it’s a feature” is a popular rhetoric of developer.)

The other problem you’re describing is a feature too. On Tatoeba, on any page you open, you’ll always see the latest keywords you looked up in the search bar. Apparently, this feature has been implemented by Trang in the early days of Tatoeba (beginning of 2009) and it’s still there: https://github.com/Tatoeba/tato...c71c7e3ac31dd9
gillux
2016-11-29 12:46
> Problem:
> Tab #1 now shows From: Turkish To: Malay, taking the language settings from tab #2!

It’s not a bug, it’s a feature. :-)

You have a point, though. I like to assign a type of search for each tab, too. For the time being, you may use Firefox’s private browsing. It allows you to open one more session simultaneously with the non-private one.
keyboard_arrow_left 1234567...20