Thread #25034 - Tatoeba

I spent a little time fixing furigana mistakes. Since I am in most cases not allowed to remove the "furigana mistake" tag, I marked the corrected sentences with "furigana fixed", so you can easily find them here: https://tatoeba.org/eng/tags/sh..._with_tag/6998

I found a few small bugs with the new furigana system, though.

Apparently the full width question mark (？) is seen as a character that can receive furigana, which means it shares the furigana with any character following it, such as in the sentence #2929493. The same problem exists with other full width characters, as can be seen here on the test server: https://dev.tatoeba.org/eng/sentences/show/3837896
I had to remove the｛｝after ＝ and ｔｅｓｔ , because got errors that those strings lacked furigana.
Note that full width characters don't "share" furigana with the following characters in the automatically generated example: https://dev.tatoeba.org/eng/sentences/show/3837895 (click the black かな buttton to view)

Since this behavior doesn't happen with the Japanese comma, full stop, and quotation brackets (、 and 。 and 「」), I assume they are in one way or another categorized as exceptions. I would in that case add several other reading signs, such as （）？！ to the same list.

Most other signs, such as ％ and letters and numbers should probably not be added to that list and be able to take furigana, for sentences such as #117779 and #161874. However, for sentences such as #408303 it would be useful if they are allowed to have an empty furigana value. At the moment the furigana system does not accept either of the following:

そうです。日本語｛にほんご｝ではウェートレスは英語｛えいご｝の"waitress｛｝"と"weightless｛｝"にも該当｛がいとう｝する。でも"waitress｛｝"という意味｛いみ｝が普通｛ふつう｝だね。

そうです。日本語｛にほんご｝ではウェートレスは英語｛えいご｝の"waitress"と"weightless"にも該当｛がいとう｝する。でも"waitress"という意味｛いみ｝が普通｛ふつう｝だね。

due to this error:

The following characters lack furigana: w, a, i, t, r, e, s, s, w, e, i, g, h, t, l, e, s, s, w, a, i, t, r, e, s, s.

There's also a related problem with numbers. The current automatic furigana system does not assign furigana to numbers, so the string ８才 gets assigned ８才｛さい｝. This displays correctly in the preview before accepting the automatic furigana as correct, but when accepted it reads it as [８才](さい) (furigana shared by both characters), instead of ８[才](さい) (furigana only on the kanji)
(I made an example of this on the test server: https://dev.tatoeba.org/eng/sen.../show/3837897)

If numbers are made to accept a null string this could be fixed by automatically adding a null furigana value to numbers, which can optionally be filled manually. So then the system would automatically assign ８｛｝才｛さい｝ instead of ８才｛さい｝, which can then manually be changed to ８｛はっ｝才｛さい｝ or ８才｛はっさい｝ when desired.
However, the disadvantage to this would then be that strings like 10分 would most likely get assigned 10｛｝分｛ふん｝, which doesn't take the in account the consonant change from ふ to ぷ that happens due to the preceding "10". This is easily missed, which would most likely lead to a number of cases of not entirely wrong but certainly misleading furigana getting accepted by well-meaning users.

A bit of a long post, but hopefully it was useful feedback.

hide replies show replies

gillux December 8, 2015 December 8, 2015 at 2:12:37 AM UTC

link

Permalink

Thank you for your feedback.

> Since this behavior doesn't happen with the Japanese comma, full stop, and quotation brackets (、 and 。 and 「」), I assume they are in one way or another categorized as exceptions. I would in that case add several other reading signs, such as （）？！ to the same list.

Yes, you guessed right. I expected some bugs like furigana expanding leftward since the implementation isn’t great. The root of this problem is that bracket syntax. It’s easy to edit for humans, but hard to parse for computers, because it’s not clear which characters the furigana belongs to. Furiganas are actually internally stored using the computer-friendly syntax [漢字|かんじ], which is why autogenerated furiganas do not expand oddly. You can directly input furigana using that syntax if you want to work around expansion bugs, but of course the goal is not to have to use it.

I don’t know if we should enforce furigana over every word that is not Japanese. I’m tempted to say that we should, but I’d like to have the opinion of Japanese contributors.

About ８才, as you said cases like 10｛｝分｛ふん｝ may be misleading. Since さい expanded over the whole ８才 is easier to spot, I think I’ll keep things the way they are now.

hide replies show replies

tommy_san December 8, 2015, edited December 8, 2015 December 8, 2015 at 4:06:47 AM UTC, edited December 8, 2015 at 2:39:57 PM UTC

link

Permalink

> I don’t know if we should enforce furigana over every word that is not Japanese.

I think that depends on how we pronounce it. When I see the word "Twitter" in a Japanese text, I pronounce it as ツイッター, and you should do so, too, otherwise you might not be understood, so we definitely need a furigana here. On the other hand, I'd rather pronounce the English words in #236357 as in English, so I don't feel like adding furigana to them.

Something seems to be wrong with this sentence, by the way. I get the error "The provided sentence differs from the original one near “と”." when I try to edit the furigana. I guess it's because of the space before "heir". See also #4720072.

hide replies show replies

gillux December 14, 2015 December 14, 2015 at 5:42:45 PM UTC

link

Permalink

I see. I’d like to generate furi with empty brackets after English words, like it is the case at the moment, so that the user needs to either fill them or remove them. This should work for other languages than English. However, it’s a bit harder than is sounds.

Currently, the validation rule is of the form “require furigana on anything except kana and punctuation”. Since I can’t think about all the classes of characters that should have furigana, I wanted to start with a strict rule so that we can soften it as we find exceptions. But if we allow foreign words, instead we need to explicitly list all the characters that require furigana, and the rule becomes “require furigana on kanji, numbers” and what else? I can’t think of everything. What about percent and other math symbols?

hide replies show replies

tommy_san December 15, 2015 December 15, 2015 at 12:35:31 AM UTC

link

Permalink

Do you need to require anything at all? Wouldn't it be enough to show an alert message like "Are you sure you don't want furigana on the characters ...?" and just leave these characters without furigana if the user wants it that way? If you'll make a special page for corpus maintainers that lists such sentences, we'll be able to easily correct the sentences that actually lack necessary furigana.

Menu

Need some help?

Developers

About