clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search

gillux's messages on the Wall (total 390)

keyboard_arrow_left 1234567...20
gillux
19 days ago - 19 days ago
Note to contributors: I’ve improved the language autodetection feature, so it should work better now. It should also become more accurate over time.

Long story:

For those who don’t know, when you add a new sentence and select "autodetect" for the language, there is a tool called Tatodetect that guesses the language of your sentence. Tatodetect works by making a statistical analysis of the Tatoeba corpus to learn what words are used in what languages. So basically the more sentences there is in a given language, the more accurately Tatodetect can autodetect it.

However, there was a limitation: Tatodetect can not learn from new sentences unless it performs a new (costly) analysis of the corpus. As a result, we had to manually start new analyses of the corpus every now and then, so that Tatodetect could learn from newly added sentences. The last analysis was from June 2017. I ran a new one today and I automated this process. The corpus is now going to be re-analysed on a weekly basis.
gillux
30 days ago
Thanks for the improvements, it feels quite usable already.

About the profile languages. How about just bringing them on the top of the list, like on the current dropdown? This way, I can still use the mouse or tap on a touchscreen to easily select one of my profile languages, while the person in your example won’t be confused by seeing only two options. You could also put a different background color for the profile languages, to make them stand out of the rest of the list.

Other than that, I find the interline space a bit too large inside the list. After clicking on the field, the drop down shows "Any language" + 4 languages (the last one slightly truncated), while I think there is enough space for 6 or 7 languages there. This would be a significant improvement if you implement what I said about the profile languages.
gillux
2018-09-14 07:46
That’s a good point. This could make that new dropdown harder to use on devices without a physical keyboard, for example.
gillux
2018-09-13 06:41
Great! It is definitely more comfortable to use.

A few comments:

The highlight is only shown when the language starts with the entered text. For example, using the English UI, typing "rus" highlights "Rus" in "Russian" and "Rusyn" but not in other entries, like "Belarusian".

The sorting of the suggested values could be improved. In the above example, I think "Russian" should show up above "Belarusian".

I can type anything that is not a language name and press the search button. The result is that whatever wasn’t a language name is treated as "any language". This is quite misleading. I think the form shouldn’t allow clicking the search button without a properly selected value as language.

On the search bar, the keyword field, the language drop downs and the search button use to have a consistent height. Now, the drop downs are bigger than the keywords field and the button.
gillux
2018-09-10 15:36
Not that I want to argue about whether we should implement this feature or not, but I’m curious about the way you proofread sentences. I am not a corpus maintainer, so I don’t know what it takes to proofread many many sentences.

As a native speaker of French, I almost only add French sentences, but it doesn’t mean they are free of errors. I regularly get comments about mistakes here and there. It’s mostly more about orthography than naturalness, but still. This makes me think that the amount of trust I’d put in a sentence has more to do with the number and quality of proofreads than the nativeness of the author.

So my point is: shouldn’t sentences be equally checked whether they are from native speakers or not?
gillux
2018-09-10 15:21
As a general rule, as long as you can listen to something, it can be downloaded. It’s just a matter of whether the website makes it user-friendly or not. On Tatoeba, it isn’t user-friendly (yet), and the reasons include what Guybrush88 and deniko said.
gillux
2018-09-06 18:27
I totally agree with what deniko said.

I think that formality is just one of the many aspects of a language that can be confusing for learners the first time they see it. But once you get it, it’s not a problem any more. Correct me if I’m wrong, but what you said can apply to, say, future tense. It’s confusing for beginners who only know about the present tense to be shown sentences in future tense, so let’s separate sentences by tense (actually, some people are doing this already, using tags like https://tatoeba.org/jpn/tags/sh...ith_tag/6704).

For more information about how to add tags, see https://en.wiki.tatoeba.org/art...w-to-add-tags.

Personally, I wouldn’t make too much assumptions on how my sentences are going to be used and by who. I don’t like the idea of restraining or changing the way I write sentences just because maybe, a non-native speaker will not understand. Quite the contrary, I think Tatoeba is a good place to add colloquial sentences, because there are certainly enough textbooks out there full of formal sentences. Consider the following guidelines, from https://en.wiki.tatoeba.org/art...ow/guidelines:

• We don't want the awkward, unnatural-sounding translations seen in textbooks to help students understand how another language is constructed.
• We want sentences that a native speaker would actually use.
• Remember that others will be using the translation that you make into your own language to study your language.

If you’re still unsure, you can also ask @Silja’s opinion since she’s the corpus maintainer of Finnish.
gillux
2018-09-03 04:09
On WhatsApp, you let the whole group know about your personal phone number by just joining it. I believe some people are not okay with that.
gillux
2018-08-30 06:15
Tes arguments sont convaincants. Tu distingues en particulier écrire le français et éditer le français, et je trouve intéressante cette façon de voir les choses. Il s’agirait donc plus de changer la présentation du contenu que le contenu lui-même, même si en pratique cela nécessite de modifier le contenu.

> Est-ce que cela te paraîtrait plus juste et acceptable ?

Oui, mais je pense qu’il est important de bien cerner le problème que l’on cherche à résoudre à l’heure actuelle et pas juste uniformiser pour uniformiser. Si le problème est la déduplication, alors inutile de modifier les phrases, on peut adapter Horus. Si le problème est que les signes de ponctuation double se retrouvent seuls à la ligne, alors on pourrait modifier la façon dont ils sont affichés sur le site (remplacement à la volée). Si le problème est que le corpus français perd en crédibilité à cause d’une typographie hétérogène, alors il faudrait décider collectivement de conventions et peut-être adapter le site pour faciliter leur application. Que cherche-t-on à faire exactement et pourquoi ?

À noter que le corpus n’est pas uniquement utilisé sur le site, mais également par d’autres projets qui téléchargent les phrases pour les réutiliser : http://a4esl.org/temporary/tatoeba/links.html
gillux
2018-08-19 05:45
Je suis contre.

Tu dis que "on peut considérer les deux premiers cas comme des cas incorrects, et les deux derniers comme des cas corrects". Je ne suis pas d'accord, car les choses ne sont malheureusement pas si simples. L'ennui, c'est qu'il n'existe (à ma connaissance) aucune instance normative de la typographie. En d'autres termes, personne n'a l'autorité pour dire "ça c'est faux", "ça c'est juste". Il n'y a que des conventions. On peut par exemple regarder comment fait l'Imprimerie nationale, car ils sont super calés, mais ce ne sont pas eux qui dictent les règles. En fait, personne ne dicte aucune règle, il n'y a que des usages, des conventions, suivient par certains et pas par d'autres. Par exemple, les québécois préfèrent omettre l'espace devant les signes de ponctuation doubles :
http://www.guylabbe.ca/blog/reg...ec-france.html

Dans ce contexte, il est difficile d'affirmer que ceci est correct et que cela ne l'est pas. A contrario, l'orthographe et la grammaire sont normalisées en France par l'Académie Française, qui publie son dico et ses textes dans le journal officiel sur lesquels on peut s'appuyer pour dire qu'un mot est correct ou non (même si tout le monde n'est pas d'accord avec eux).

C'est pourquoi, même si l'idée de normaliser la typographie dans le corpus français est alléchante car elle simplifie les choses (et en tant que développeur du site, j'aime quand les choses sont simples), je m'y oppose. Cela reviendrait à imposer une façon d'écrire le français et d'ignorer les autres, qui sont pourtant usités et pas incorrectes. Or, je pense que sur Tatoeba, on cherche plus à inclure tous les usages. C'est une question de principe. Pour moi, c'est un peu comme si on essayait d'imposer les caractères simplifiés en Mandarin (même si le cas du Mandarin est beaucoup plus polémique).

Je suis conscient que la pluralité des typographies gêne la détection des doublons par Horus. Mais pour moi, le problème ne vient pas du corpus, il vient d'Horus, ou de notre façon d'organiser le corpus. Il ne faut pas adapter le corpus aux outils ; ce sont aux outils de s'adapter. Je veux bien essayer de m'y coller, si j'ai le temps, mais un autre problème de la typographie, c'est qu'elle est infiniment complexe. Ça a l'air simple quand on parle des espaces devant les points d'exclamation, mais ajoute à ça rien que les dialogues, les citations imbriquées et l'italique, et ça devient une autre paire de manches ! Je t'invite à lire Orthotypographie de Lacroux pour t'en convaincre.
gillux
2018-08-07 21:39
The search feature is back!
gillux
2018-07-19 19:26
Thank you for reporting the problem, PaulP. I added it on our bugtracker: https://github.com/Tatoeba/tatoeba2/issues/1614
gillux
2018-07-09 15:04
Sorry, my mistake! The problem should be fixed now.
gillux
2018-07-04 22:25 - 2018-07-04 22:29
Note that we currently ignore Arabic question mark as well as other punctuation characters. The complete list of code points currently treated as part of words (not ignored) can be found here: https://github.com/Tatoeba/tato...ll.php#L80-L82

The meaning of a line like:

'U+621..U+63a', 'U+640..U+64a'

Is: "treat characters from code point U+621 to U+63A as part of words, then ignore characters from U+63B to U+63F, and then treat characters from U+640 to U+64A as part of words"
gillux
2018-07-04 22:16
> I think that the punctuation marks (especially the question mark) shouldn't be ignored.

I’m surprised. If the meaning of the Arabic question mark is the same as in English, it should be ignored. Let me explain more in details the effect of ignoring a character in the search.

For each sentence, the search engine makes a list of all the words it contains. This way, when you look up a word, it can quickly find all the sentences that include it. This process of extracting all the words contained in a sentence is done by using two lists of characters. One list (let’s call it I) consists of all the characters a word can include. Another list (let’s call it E) consists of all the characters that are not part of any word. List E are the characters that separate words (like space, punctuation) while list I are the ones that make up words (like letters). Now let’s have a look at a concrete example:

List I = abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
List E = .:;,?!'()[]- (plus the space character)
Sentence = I'm old-fashioned?

In that sentence, the following 4 characters are in list E: '-? (and space).
All the rest is in list I.
Thus, the search engine finds 4 words: I, m, old, fashioned.
It works as expected because people can find the sentence using one of these 4 words.

Now, let’s take the same sentence and remove the question mark from list E. The search engine now finds 4 words: I, m, old, fashioned?
Here, the question mark is treated as a normal character, just like a letter. It’s part of the word. Consequently, if you look up "fashioned", you won’t find the sentence. You have to look up "fashioned?" instead.

When we say "character X should be ignored", it means "character X should be in list E". If you (or anybody else) can use that Wikipedia article as a reference list to sort out which characters should be in list E or list I, we can improve the search of Arabic sentences.
gillux
2018-07-03 20:28
Thank you odexed. Your comment suggests that we shouldn’t blindly ignore all the diacritics.

> The best solution to this problem would be to make the search engine adjustable

I agree, but it’s not possible at the moment.
gillux
2018-07-03 14:53
Thank you for reporting this, OsoHombre. Yes, it’s possible to tweak the search engine to ignore Arabic diacritics. But we need your help because we don’t know Arabic, so we don’t know exactly what we’re doing when we try to make such changes.

First of all, you need to be aware that if we make the search engine ignore diacritics, visitors won’t be able to look up only "الشعب" or to look up only "الشّعْبُ", because these words would be treated as if they were the same word. In other words, searching for either of them would produce the same results, and these results would include sentences with both spelling. So we need to make sure that such change doesn’t get into the way of many people who would like to look up only "الشّعْبُ", for example. Now, if you’re sure that ignoring diacritics is fine, we need a complete list of all the diacritics that needs to be ignored, and we’ll tweak the search engine accordingly.
gillux
2018-07-02 14:34
We ran into an update problem that prevented about half of the members from logging in normally for the past 23 hours. The problem should be fixed now. Sorry for the inconvenience.

When trying to log in, members affected by the problem were being told that their password was wrong, even though it was correct. One way to circumvent the problem was to use the "Forgot password" feature to reset your password. If you did so, don’t forget to change it after successfully logging in.
gillux
2018-06-29 21:32
Hello BRW and welcome,

It’s always good to get opinions from new contributors because you’re not too much "used" to the quirks of this website yet. :-)

I think you have a point here. Indentation could be used to better show the relationships between sentences. However, our structure is not a tree but a graph. In your example, let’s say there is a sentence C that is a translation of both A and B. It makes C a "translation of translation" of XYZ, but then where should we show it? Under A or B? Or maybe under both of them?

About your other suggestion, the contents of a sentence is already shown on hovering a sentence number (for example in your initial post in this thread). About showing the number when hovering text, can you explain in which kind of situation it would be useful to you?
gillux
2018-06-18 16:52
This would definitely be a useful feature. Unfortunately, it’s not technically easy to implement it, because of the way the search is currently performed. I created an issue on our bugtracker to keep track of your suggestion: https://github.com/Tatoeba/tatoeba2/issues/1576
keyboard_arrow_left 1234567...20