menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
blay_paul blay_paul February 26, 2010 February 26, 2010 at 2:15:19 PM UTC link Permalink

Romaji generator.

Are the tool and data used available online anywhere?

{{vm.hiddenReplies[254] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko February 28, 2010 February 28, 2010 at 12:03:05 AM UTC link Permalink

yep it's from the kakasi project
http://kakasi.namazu.org/
as said before, the project seems more than dead :(

{{vm.hiddenReplies[255] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul February 28, 2010 February 28, 2010 at 9:12:54 AM UTC link Permalink

Oooh yes. I remember this now.

The bad news is that kakasi probably isn't really fixable. I think you'd need to re-writing the code in a major way, not just add a few lines to the dictionary, to fix it.

The good news? Removing the line
ぜつ 絶
from the file 'kakasidict' may correct one romaji error in generated romaji.

{{vm.hiddenReplies[256] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko February 28, 2010 February 28, 2010 at 12:21:50 PM UTC link Permalink

I think we can also try to find if there's people motivated to start a project for automatic romanization of japanese, or looking if there's not an embryon of such project and see how we can help

{{vm.hiddenReplies[257] ? 'expand_more' : 'expand_less'}} hide replies show replies
contour contour February 28, 2010 February 28, 2010 at 5:54:22 PM UTC link Permalink

For now, if there was the possibility to enter the romaji explicitly, and if manually entered and automatically generated romaji could be separated, that should make for a good test set for evaluating different methods for automatic generation.

I think that ideally one would start with a mature project, and automatically add corrections to the training set.

{{vm.hiddenReplies[258] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul February 28, 2010 February 28, 2010 at 7:20:08 PM UTC link Permalink

> I think that ideally one would start with a mature
> project, and automatically add corrections to the
> training set.

There are six main approaches that could be taken.
1. Drop romaji support.
2. Allow manual correction of romaji.
3. Develop romaji generation code that uses the WWWJDIC index line.
4. Further develop kakasi
5. Look for alternative romaji conversion software.
6. Develop romaji conversion software from scratch.

I would recommend 1, 2, or 3.

4. Could be done, but I think you would soon reach limits on what is achievable.

{{vm.hiddenReplies[259] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko March 1, 2010 March 1, 2010 at 12:09:20 AM UTC link Permalink

(I don't speak japanese at all, so excuse me if i speak non sense)

Nemo talk about JUMAN to replace kakasi, which can output in kana,
is kana not better as that way we're sure people who can't write japanese will not "accidently" mess up the "romanization", by restricting the "reading" part to kana characters , and we're also sure people use the same convention as there's only one kana per "sound"
(Trang always take about different way to write the romaji)

what do you think ? Trang ?

{{vm.hiddenReplies[260] ? 'expand_more' : 'expand_less'}} hide replies show replies
Nemo Nemo March 1, 2010 March 1, 2010 at 2:51:08 AM UTC link Permalink

If Juman's kana/categorization output is accurate, it can produce 100% perfect romaji output. Kana give a representation of how something is said, along with its syntactical representation. There are ambiguities in kana, but JUMAN gives enough information that the pronunciation and syntax can be reconciled to provide a perfect, phonetic, romanization.

Nemo Nemo March 1, 2010 March 1, 2010 at 3:11:15 AM UTC link Permalink

I should give a little more information than I have in the past posts I have, I think, because there seems to have been little progress. I don't really want to come off as being harsh, but the reality is that Kakashi is a lost cause. Whoever coded the program did so in a very naive way, and to use sed to correct its errors would take an inordinate amount of both human and CPU time, and in the best case scenario, it would cause such undue load on the server so as to make tatoeba unusable. I've gotten the impression that Kakashi was chosen with little to no consideration of other options (c.f. below), despite the fact that there exist ways to accurately dissect Japanese text into parseable units, which could be further changed into romaji. The reality is, Kakashi is nowhere near mature enough to produce accurate results, and as an abandoned project there is little hope of it reaching that maturity -- its output will never get any better than it is. In contrast, Juman seems to be near-perfect, though I will admit that I have not tried the other romanizers suggested in the blog post, nor have I done extensive testing of Juman. Regardless, Juman seems to be acceptable, even optimal. Kakashi falls so far short of the mark that I'm not sure why it is even in use. I would even go so far as to state that if Kakashi remained the method of conversion, that by the time tatoeba becomes popular, greasemonkey scripts will be produced which correct romaji via some other means, if that's even feasible. (Here's the blog post I referenced: http://blog.tatoeba.org/2009/02...anization.html )

{{vm.hiddenReplies[262] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG March 1, 2010 March 1, 2010 at 8:37:21 PM UTC link Permalink

Yes, to be honest, KAKASI was chosen with no consideration of other options. It was the first one I found that when I searched for a romaji converter, so I picked it.

And only later I wrote this blog post where actually searched and I listed other solutions. Solutions that I should have explored but never had the time to =/
I completely agree with you that KAKASI is not the long term solution.

Anyway, considering you have been taking the time to write all these posts, I will take a look at Juman ;). But if you can just tell me quickly what command to use to get a Japanese sentence parsed and converted into kana, that can save me some time from going through the documentation. Ah but, does JUMAN supports UTF-8...?

{{vm.hiddenReplies[263] ? 'expand_more' : 'expand_less'}} hide replies show replies
contour contour March 1, 2010 March 1, 2010 at 10:19:40 PM UTC link Permalink

From a quick look, it looks like you have to convert to and from EUC-JP. Piping a sentence through "juman -b -c" gives one line per word, with readings in the second of the space-separated columns.

{{vm.hiddenReplies[264] ? 'expand_more' : 'expand_less'}} hide replies show replies
Nemo Nemo March 2, 2010 March 2, 2010 at 2:03:07 AM UTC link Permalink

There's a powerpoint tutorial, I'll look at it when I have time and translate it. The translated user guide focuses on the whole idea behind the system, and why it was/how it was developed, and then when it comes to the syntax, it's just a bunch of "I don't know this word" and "If you break this down, it would mean something like..."

TRANG TRANG March 1, 2010 March 1, 2010 at 10:00:51 PM UTC link Permalink

My ideal approach would be using WWWJDIC indices, combined with a better software for conversion into romaji or kana.

As for making romaji editable, if we were to make anything editable, I'd rather it be kana, like what sysko suggested.

If the purpose is to provide something useful for learning, then it's obviously better to have a sentence in kana, with spaces so that the learner knows how the sentence is composed. And of course we can use the sentence in kana to generate correct romaji.