clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search

gillux's messages on the Wall (total 442)

gillux
2019-05-20 10:40
Thank you for the numbers, that’s valuable information.

This shows that 90% of the visitors making a search are using the "simple search" (top bar or front page), and 10% the advanced search (advanced search page or "more search criteria" block).

> It seems that when given the choice, people choose in majority to order by words.

However it’s not a fair choice because you can’t tell the visitors who made a choice (clicking on the dropdown, examining the choices and choosing) from the ones who didn’t (glancing over the dropdown or not even seeing it, and using the default value). order=words being the default, I believe it is overrepresented.

I find the number for order=random surprisingly high.
gillux
2019-05-16 14:28
Thanks. I reverted it back to normal.
gillux
2019-05-15 12:35
Yes, we definitely need randomness to be reproducible (and unpredictable, to avoid "rank boost threats") if this is the direction we’re taking. If I give you a search result URL, I expect that you see the same results as me, and that it stays more or less the same for a little time. I believe it is technically possible to produce a random, deterministic and unpredictable order.

I am concerned that boosting sentences having a number of words close to average is going to be detrimental to diversity, because it’s a incentive for contributors to produce standard-sized sentences. Isn’t there a risk of uniformization of the corpus? Or do we actually want more example sentences that are "efficient" and "standard"? I’d like to know @TRANG's opinion on that matter.

The idea of showing newest sentences first (after sorting by exact matches and LCS) is interesting. It surely adds some randomness, but it’s also an incentive to produce new sentences, and it gives more exposure to new or active users.

Since there is no consensus on an alternative to sorting by number of words, for the moment I’m going to change the default search ranking the way I described on the first post of this thread. We may further improve it later on.
gillux
2019-05-14 14:17 - 2019-05-14 14:20
@Thanuir @AlanF_US @CK

Thank you for your feedback. I agree about the relative uselessness of having very short sentences showed first. The idea of randomizing the results within a category, like Thanuir said, is appealing (giving the order is deterministic), but I’m afraid it could be a little bit confusing. I temporarily set up https://dev.tatoeba.org/ like that, please let me know what you think.

Or, if we are to rank using the number of words, what would be the ideal number? Not too long and not too short. It depends of the language of course. Here are some stats about the average number of words per sentence in every language on Tatoeba: https://gist.github.com/jiru/81...5917dc18325fc2

I wonder if we could use these numbers to boost the ranking of sentences having a number of words close to the average, with a formula like rank = –abs(average – words)
gillux
2019-05-14 06:39
I’m using Firefox. I do notice a delay between the moment the green empty banner shows and the search fields appear. The delay is between one and two seconds. I think we can work on reducing the delay, I already have some ideas I wrote in a Github issue: https://github.com/Tatoeba/tatoeba2/issues/1891
gillux
2019-05-13 10:32
Dear Tatoebians,

Recently I have been working on improving the export capabilities of Tatoeba. I created the necessary code base to provide customized exports (as opposed to our generic exports of the Downloads page). In the future, I plan to allow various kind of exports, like all sentences of a specific language, of a specific user, having a specific license, sentence pairs… But for now, I just started with implementing list exports.

You can check this out on our development website https://dev.tatoeba.org/. Log in with the same credentials as here, and then go to the page https://dev.tatoeba.org/exports/index. From there, you should be able to export and download any list you have access to, no matter how big it is.

Feedback is very welcome.
gillux
2019-05-13 05:05
I’m not familiar with natural language processing. What are NMT models? Can you elaborate on what you’re trying to achieve, as opposed to the features you’d like to have? Do you want to download the search results? If so, what filtering criteria are you using?

Note that because the advanced search is limited to 1000 results, it will always give you a partial view of the corpus, so I don’t think it’s a good way to go.
gillux
2019-05-12 14:15
What is a ML experience?

I am currently working on removing this 100 sentences limit so that any list can be downloaded. You can follow my progress on this page: https://github.com/Tatoeba/tatoeba2/issues/927

The lists can be downloaded as CSV files (the same file format as the files on the Downloads page). I would like to know if this format is okay for your use case.
gillux
2019-05-06 09:50
Thanks for letting us know, you’re right that it doesn’t work like it used to. You can follow the progress of the resolution of that issue here: https://github.com/Tatoeba/tatoeba2/issues/1873
gillux
2019-04-26 06:55 - 2019-04-27 02:46
Recently I’ve been working on improving the language autodetection feature.

When you submit a new sentence or translation, Tatoeba can autodetect the language. But sometimes it guesses it wrong, so you need to manually fix the language flag. I improved the accuracy of the algorithm, so Tatoeba should now be better at guessing the language.

The algorithm learns from existing sentences to guess the language. So the more sentences we have for a given language, the better the algorithm becomes at guessing new sentences of that language.

Note that some languages are more difficult to guess than others. For example, the algorithm sometimes confuses Russian with Ukrainian, or Berber with Kabyle, or other languages. Feedback is welcome.
gillux
2019-04-26 06:41
Merci pour ton retour. Je suis confiant quant à l’efficacité de la détection pour les autres langues, donc j’ai dores et déjà installé la nouvelle version de l’algorithme sur tatoeba.org.
gillux
2019-04-24 02:33
Les 10% du corpus, c’est juste pour me permettre à moi de travailler avec un volume de données suffisamment petit pour être analysé rapidement, et suffisamment gros pour être significatif. Le temps de faire des essais et des évaluations.

L’algorithme actuellement installé sur dev.tatoeba.org a été quant à lui entraîné avec la totalité des phrases de tatoeba.org et je souhaite l’installer tel quel sur tatoeba.org. Je t’invite donc à l’essayer.

Bien sûr, le modèle sera réactualisé régulièrement.
gillux
2019-04-23 14:11
Comme promis, je me suis penché sur le problème.

J’ai construis un jeu de données en extrayant 10% du corpus actuel au hasard et j’ai utilisé ça comme base de travail. J’ai divisé cette base en deux parties, 90% pour entraîner le modèle, et 10% pour tester le modèle entraîné. Avec l’algorithme de détection actuel, j’ai constaté un taux de réussite de 94%, ce qui n’est pas trop mal. J’ai tout de même réécrit l’algorithme, parce qu’il me paraissait un peu mal fichu. Après pas mal de peaufinages, je suis parvenu à un taux de réussite de 97%. J’ai installé ça sur https://dev.tatoeba.org/, je t’invite à tester.

Il faut garder en tête que de notre point de vue d’être humain, l’algorithme peut paraître assez stupide quand il se trompe, mais ça ne veut pas dire qu’il est nul.

Si on constate qu’il n’arrive pas à détecter correctement « La stupidité n'est pas une excuse », on peut logiquement penser qu’il est plutôt mauvais. S’il se plante sur une phrase aussi « facile » (à nos yeux), qu’en sera-t-il avec une phrase plus ambiguë ?

Or, ce n’est pas parce qu’il échoue sur une phrase facile qu’il échouera aussi sur une phrase difficile. L’algorithme ne regarde pas les mots, tout repose sur de simples statistiques de co-occurences de caractères (pas encore de "deep learning", désolé ;-)). Pour l’algorithme, les phrases les plus difficiles ne sont pas les plus ambiguës, mais celles qui contiennent des suites de caractères pour lesquelles ses statistiques sont mauvaises.

Bref, pour se faire une idée de la qualité de l’algorithme, il faut regarder comment il se débrouille dans l’ensemble, et dans toutes les langues.

En regardant là où l’algorithme a du mal, j’ai noté que le Berbère est souvent confondu avec le Kabyle (et vice-versa), ce qui rend ces langues relativement mal reconnues par l’algorithme, malgré la quantité de données dont nous disposons pour elles. Je me demande à quel point elles sont proches. Pareil pour le Russe qui est parfois confondu avec l’Ukrainien (et vice-versa), là aussi je me demande à quel point ces langues sont proches. Il y a aussi les langues latines qui sont parfois reconnues comme de l’interlingua, ce qui n’est pas si étonnant vu que l’interlingua est directement basé sur les langues latines.
gillux
2019-04-18 15:34
Dear Seael,

Gracias for reporting the problem to us. I am touched by all the research you did to help us identifying the problem. Thank you.

I think I solved the problem now. If it happens again, please let us know.

Note that new sentences never immediately appear in search results. Under normal circumstances, new sentences are made available to search under 15 minutes (but this may change in the future). However you may have to wait a bit longer when the server is too busy, or when we are working on something related to search.

Modifications of metadata (like translations, ownership, tags, audio, etc.) of already searchable sentences are instantly taken into account by the search.
gillux
2019-04-18 09:28
Anybody is welcome to submit a pull request on our Github.
gillux
2019-04-06 17:59 - 2019-04-06 18:24
I’ve been playing around with our default search ranking algorithm. I insist on the "default" part because that’s what the vast majority of visitors use. I also focus on searches that do not use double quotes or any special trick. Just plain words. Again because that’s what the vast majority of visitors use.

Our current way of ranking results is pretty basic: it searches for sentences that include all the words (eventually stemmed) and sort them by total number of words in the sentence.

A problem with this approach is that the order of the words is ignored. The top result of searching for "you go there" is "There you go!" because it’s a shorter sentence than "You may go there."

Ignoring word order is especially catastrophic on languages without word boundaries, like Chinese, because the searched characters are randomly reordered into something totally unrelated. For example, the results for "可不可" in Chinese are cluttered by irrelevant "不可something". Same for kana words in Japanese.

In order to address this problem, I tentatively tweaked the default ranking algorithm on https://dev.tatoeba.org/ into something that prioritize, in the following order:

1. sentences that contains an exact match (like if searching for ="you go there")
2. sentences having the "longest common subsequence" (LCS, [1])
3. sentences having the least number of words

[1] https://docs.manticoresearch.co...anking-factors

However, I don’t know if this new ranking suits everyone out there. What do you think?

You can compare the search results on https://tatoeba.org/ (old ranking) and https://dev.tatoeba.org/ (new ranking). You can run a search on tatoeba.org, and then add "dev." in the URL bar and press alt+return to open a new tab.
gillux
2019-04-06 17:27
Je suis conscient que c’est frustrant, mais sache nous n’avons pas oublié ce problème, il est noté sur Github [1]. Mais merci de nous le rappeler!

C’est toujours le même algorithme de sysko qui détecte les langues, donc cela doit venir de la base de données sur la laquelle il s’appuie. J’avais tenté de la mettre à jour, mais ça n’avait pas résolu le problème. Je vais investiguer ça prochainement et je te tiendrai au courant si j’ai besoin de ton aide pour tester.

[1] https://github.com/Tatoeba/tatoeba2/issues/1731
gillux
2019-04-05 22:01
We upgraded our search engine to the latest version of Manticore. Manticore is a fork of Sphinx. You shouldn’t notice anything new because the search functionality remains the same. It just improves performance a little bit and paves the way for future improvements.

That said, while we were at it, we added stemming support for four additional languages:

• Danish
• Hungarian
• Romanian
• Norwegian (Bokmål)

Have a look at this page if you wonder what stemming is about: https://en.wiki.tatoeba.org/art...h#more-details
gillux
2019-04-04 11:44
The page is: https://tatoeba.org/licensing/switch_my_sentences

I updated the wiki page, thank you.

Beware that this feature is still under development. Feedback is welcome.
gillux
2019-04-03 15:33
Oui. Comme cette fonctionnalité est encore en développement, elle est pour le moment seulement accessible à certaines personnes qui souhaitent publier leurs phrases sous licence CC0. Si vous souhaitez vous aussi avoir y accès, demandez à un administrateur.