{{}} No language found.
{{}} No language found.

gillux's messages on the Wall (total 401)

2014-09-17 21:06 - 2014-09-17 21:18

I recently worked to improve the furiganas for sentences of the Japanese language. The furiganas are now displayed as hiraganas instead of katakanas. In addition, they are no longer attached to words already in kanas. (Actually, it’s not perfect: when a word contains a mix of kanas and kanjis, the whole word, including the kana parts, is displayed in the furigana.)

In other words, we now have (#3501384):
言い訳[いいわけ] ばっか すん な よ 。
Instead of:
言い訳[イイワケ] ばっか[バッカ] すん[スン] な[ナ] よ[ヨ] 。[。]

Last but not least, the furiganas should contain less errors than they used to. For instance, 来ない is now correctly read as こない instead of *きない. But beware, furiganas are still not 100% accurate.

EDIT: On a side note, I’d like to mention that deploying the updated version of our (terrible) furigana generation software on was a piece of cake, thanks to the work of pallavshah, one of the GSOC student who worked on Tatoeba this summer. In other words, he saved us hours of tedious work and we can develop faster and safer.
2014-09-16 13:57
I got your point. I agree that we should allow multiple readings, but then things get complicated.

If one can come up with a neat way to display these readings for Latin-based languages (like a tooltip or something), how are we going to display multiple readings for Japanese?

What will be the order of the readings (which one we put on the top or bottom of the list)?

What reading should we use for transliterations (like romanization)? We just can’t say that one of the readings is “the main one” because people are likely not to agree.

And there may be other issues I’m not thinking about yet.

Alternatively, we could say that we only allow one reading, which is up to the owner. It doesn’t mean it’s the only way to read, but it’s how the owner would read it.
2014-09-16 12:23
> I think every language would require some kind of reading aid at least partially to show how to read, say, "2014" or "Louis XIV".

That makes sense. Actually I have the feeling that the way furiganas are implemented is kinda wrong because they are treated as an alternative script, just like e.g. romanization of Chinese. As a result, Japanese sentences are displayed twice (one with furiganas and one without), which is a total waste a space and a bad way of presenting sentences. Implementing reading aids as a different concept would both solve this bad presentation and make it available for any language.

On the other hand, although reading aids may help with pronunciation, they are inexistant in Latin-based languages, so we won’t be showing the reality of these languages by attaching reading aids. This could e.g. trick Japanese learners into thinking that English can actually have reading aids just like Japanese uses furiganas.

> You need also to keep in mind that sometimes multiple readings are possible (eg. 何 and 明日 in Japanese). Quality of reading aids should be controlled much the same way as the sentences, since there are some readings that are theoretically possible but unlikely.
Do you mean we should allow multiple readings? I’m not sure it’s a good idea.
2014-09-08 19:33
I actually started to draft a possible implementation in that ticket:

It would be great we could list all the languages that could benefit from having editable alternative scripts, so we can implement a solution that could easily be ported to other languages. We basically need to know:
— the language;
— its alternative script(s), how they derivate from the main script and what are they used for (a link to the Wikipedia page should be enough);
— whether the script(s) can be computer-generated with 100 % accuracy;
— if not, what are the tools out there that can generate a partially accurate script;
2014-08-30 11:31 - 2014-08-30 11:41
#1590115#1854139 を誰かリンクしてくれませんか?
Can someone link #1590115 and #1854139?
Quelqu’un pourrait-il lier #1590115 et #1854139 ?
2014-08-14 19:17
I’m thinking about a way to solve this and I’ve explained it into a ticket.
2014-08-12 18:16 - 2014-08-12 18:21
> Is it possible to sort the results differently? Now if I want to find for example "Japanese", I type in "ja", I get several results that are alphabetically before "ja" but just contain "ja" somewhere in the middle (like Chinyanja, Gujanti). I don't think any one who wants to find Chinyanja types in "ja", but rather "chsomething". So, is it possible to get first in the filtered list the ones that actually begin with the search string?

Edit: so just like Trang I misred your comment, sorry. So yes, I think it’s possible, with a bit of hacking into Chosen.
2014-08-08 20:39
Tatoeba uses gettext to localize text. The problem you describe can probably be solved by using the so-called contexts [1]. This allows to have different translations of the same original string for different contexts. For instance "French" in the context of a language list may have a different translation than "French" in the context of "French example sentences". Impersonator, do you think this would solve the problem?

2014-08-06 05:21
Can you give us more details about this problem? Does the redirect only happens when you click on the login button? Does it always happen? What is the sentence page you’re getting redirected to? What was the time and date? What’s your browser? What’s your IP address? (You can get it by going here

Have you tried clearing you browser cache? (
2014-08-03 04:15
Yes, we figured there’s something wrong. We’re working to solve the problem.
2014-07-21 12:20
The French wikipedia features something like this:

Click on the search field then click on the keyboard icon that drops down. I don't know anything about Esperanto, but I've been able to input diactrics with the ''Esperanto q sistemo" and typing "s q" "u q".

gleki, does something like this fulfill your needs? What about the other Esperanto keyboards available on Wikipedia?
2014-07-17 13:30
You learned the leaky abstraction problem the hard way. I think it’s a problem every developper (or even anyone creating things in order to ease others’ work) should be aware of. Here is a nicely written article about leaky abstractions

Good luck for the future of pytoeba.
2014-07-16 16:27
The more I think about this, the more it seems to me like an essential feature Tatoeba needs more than anything since a long time ago. I think people involved in Tatoeba all have their own goals, things they care about and things they don’t. A forum-like structure would allow to federate people with similar goals. Goals like “make more English-Japanese pairs”, “add more sentences in language X with rare words”, “grow the corpus of language X”, “fix mistakes in sentences of language Y” etc. People who want to ask specialists of a given field (like X-Y translators, natives of language X…) could post on a specific topic to efficiently reach them.

Instead of this, Tatoeba feels to me more like a place where everyone is working on his/her side, and communicating mostly privately with users they got to know incidentally from the sentences authorship or comments. This is not working great. (Please correct me if you don’t share that feeling.)

I wasn’t here by the time this Wall system was installed, but I wonder what were the reasons behind crafting ours thing like this, instead of the so popular forum structure with topics, that has been implemented hundreds of times already.
2014-06-30 01:52
For the record, language names were previously sorted with PHP’s natcasesort(), which is English centered. They are now sorted using the ICU library, which sorts words using langage specific comparison rules. For more information, visit and look for “collation.” ICU is supposed to do a good job, but as we developpers don’t know every langage collation rules, feel free to comment on this.
2014-05-30 10:02
Hi Impersonator,

As far as I understand, the orginal French just says “it is sufficient to give attribution in the edit summary, which is recorded in the sentence history when importing the text.” It doesn’t say “Where such credit is commonly given through sentence comments.”
2014-04-26 14:33
Silja, thank you for reporting this! It’s now fixed. But you probably need to clear your browser’s cache to get the favicon (in Firefox, press crtl+shift+del, tick cache and press the delete button).
2014-02-24 19:06
Good news, everyone ! This year, Tatoeba has been accepted to participate in the Google Summer of Code. This means students are going to contribute to Tatoeba by coding during the summer. For more informations about the Google Summer of Code :
2014-02-22 10:45

I’m thinking about importing French sentences from the 8th. dictionnaire de l’Académie française (released in 1932). This dictionary is in the public domain (so licence-compatible with Tatoeba), and it includes quite a few excellent example sentences for any word. For more details about the relevance of the sentences and the import feasability, see the French version of this message below.

I’d also like to know how massive sentence import is usually performed on Tatoeba.



J’ai dans l’idée d’importer des phrases françaises depuis le 8e dictionnaire de l’Académie française (sorti en 1932). Ce dictionnaire est dans le domaine public (donc compatible avec la licence de Tatoeba), et comporte un certain nombre d’exellentes phrases d’exemple pour n’importe quel mot. Il est accessible en version numérisée sur le site du CNRTL. Voyez par exemple l’article sur le mot mobile :

Les phrases d’exemple intéressantes sont en italique dans l’article. Depuis un tel article, on pourrait importer des phrases telles que :
* L'aiguille aimantée est mobile sur son pivot.
* Cette roue n'est pas assez mobile.
* Un mobile imprime une partie de son mouvement à un autre mobile qu'il rencontre.
* L'amour de la gloire est le mobile de grandes actions.
* L'appât du gain est son unique mobile.
* Il n'y a eu dans sa conduite aucun mobile intéressé.

Comme vous pouvez le voir, les phrases sont courtes, naturelles, et montrent bien comment employer le mot mobile. Une aubaine pour un projet comme Tatoeba !

En revanche, on trouve aussi des phrases nominales comme :
* Un mobile. Les mobiles bretons. Les mobiles de l'Oise. La surface mobile des eaux.

Je trouve que ces phrases nominales ne sont pas terribles et ne devraient pas être importées. Je ne sais pas à quel point le tri de ces phrases peut être automatisé (par exemple en partant du principe qu’elles ne contiennent pas de verbe). Pour le moment, je vous présente simplement mon idée et les premiers problèmes que j’identifie. Si une vérification manuelle est nécessaire, cela nécessiterait un travail collaboratif assez colossal, vu qu’il y a environ 32 000 entrées dans ce dictionnaire, donc facilement plus de 100 000 phrases.

J’aimerais aussi savoir comment se déroule habituellement l’import massif de phrases sur Tatoeba.

Techniquement, l’import de ce dictionnaire a déjà été effectué dans le wiktionnaire français. Je crois même que c’est sur cette base qu’il a démarré. L’objectif des wiktionnairiens était différent, mais on pourrait réutiliser leurs outils et leur savoir :
2014-02-01 20:09
Bienvenue dans le monde réel.
2014-02-01 19:24

Frankly, Pharamp, this is quite shocking. I am strongly against “solving” this non-problem.

As long as natives do write numbers using digits and words in real life, I see no reason not to allow both. Languages are not this much consistent, and so Tatoeba should be. Researchers and learners should embrace languages like they actually are, and not change them in order to match their own specific needs.

Furthermore, there are plenty of contexts where numbers are almost always written with digits, or almost always written in words. You just can’t mass-edit one way or the other for the sake of consistency. Of course every writing is possible, but conventions exist and should be reflected in Tatoeba’s content. Just to name a few : I was born in 1952. There are 1,952 sentences. The book costs $19.52. I got a 404 error. My phone number is 0123456789. I’ve got a 32 bits processor, and a Nintendo 64. The next train arrives at 7:50 p.m. She killed two birds with one stone. One should know that. Remember that two wrongs don’t make a right. Give me five! He can talk French twenty to the dozen. Le weekend du 15 août. Je me suis mis sur mon trente-et-un. Appliquons la règle de trois. Vingt-deux, voilà les flics. C’est trois fois rien. Mille mercis. Les mille et unes merveilles du monde. (Now I’ve got to add all these to Tatoeba.)

To me, the reading problem that others mentionned is a different problem that should be solved with a different solution. Like adding audio, or adding readings (like Japanese already has, though it’s broken at the moment).