menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
Demetrius Demetrius June 9, 2010 June 9, 2010 at 5:58:33 PM UTC link Permalink

If okurigana is incorrect, can we just file this as bugs in MeCab?

{{vm.hiddenReplies[1223] ? 'expand_more' : 'expand_less'}} hide replies show replies
Demetrius Demetrius June 9, 2010 June 9, 2010 at 6:43:11 PM UTC link Permalink

*furigana

blay_paul blay_paul June 9, 2010 June 9, 2010 at 6:57:14 PM UTC link Permalink

In theory. However they are not really bugs in MeCab, but problems resulting from the dictionary used with MeCab. The dictionary used can both be selected (from a very short list ;-) and can be altered or aided by user-defined dictionaries.

So really what it needs is someone familiar with MeCab to find the best dictionary available and to add fixes for the problems noted.

However it probably is never going to be possible to be 100% accurate so to get the best results manual corrections will be needed at some point.

{{vm.hiddenReplies[1225] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen June 10, 2010 June 10, 2010 at 8:44:49 AM UTC link Permalink

It's not that simple.

Consider the sentence: 君たちの訳文と黒板の訳を比較しなさい。 MeCab suggests わけ as the reading of the solo 訳, whereas we all know it's やく. MeCab's usual dictionary (NAIST-JDIC) has both versions of 訳, and no amount of adding dictionaries is going to "fix" it. MeCab uses some very sophisticated AI to segment sentences, and the dictionaries have parameters derived from training on hand-segmented texts. The trouble is that ...の訳を... could be either, and you need the context of the whole sentence to decide which is which. In fact the weightings for 訳/わけ and 訳/やく as solo lexemes are the same. You could probably fiddle the weights on 訳 to make it produce やく, but most of the solo appearances of 訳 in Tatoeba are in fact わけ.

There is a whole research field of Word Sense Disambiguation (WSD) working on problems related to this, but I don't think there are any packaged solutions for Japanese that can be plugged into Tatoeba. Just be grateful we have MeCab - 20 years ago automatic Japanese segmenters were thought to be impossible to build.