menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
Nemo {{ icon }} keyboard_arrow_right

Profile

keyboard_arrow_right

Sentences

keyboard_arrow_right

Vocabulary

keyboard_arrow_right

Reviews

keyboard_arrow_right

Lists

keyboard_arrow_right

Favorites

keyboard_arrow_right

Comments

keyboard_arrow_right

Comments on Nemo's sentences

keyboard_arrow_right

Wall messages

keyboard_arrow_right

Logs

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptions

translate

Translate Nemo's sentences

Nemo's messages on the Wall (total 13)

Nemo Nemo March 2, 2010 March 2, 2010 at 2:03:07 AM UTC link Permalink

There's a powerpoint tutorial, I'll look at it when I have time and translate it. The translated user guide focuses on the whole idea behind the system, and why it was/how it was developed, and then when it comes to the syntax, it's just a bunch of "I don't know this word" and "If you break this down, it would mean something like..."

Nemo Nemo March 1, 2010 March 1, 2010 at 3:54:10 AM UTC link Permalink

I've looked through a lot of contributions, and I've come to the realization that there are a LOT of contributions in English made by non-native speakers. I assume the same is the case for other languages, especially Japanese. There needs to be some sort of indicator for each sentence on whether or not the last editor was a native speaker. I've seen a lot of English sentences that are perfectly grammatical, with no errors at all, that I have never in my entire life heard someone utter -- correct or not, a native speaker would never say them.

Nemo Nemo March 1, 2010 March 1, 2010 at 3:11:15 AM UTC link Permalink

I should give a little more information than I have in the past posts I have, I think, because there seems to have been little progress. I don't really want to come off as being harsh, but the reality is that Kakashi is a lost cause. Whoever coded the program did so in a very naive way, and to use sed to correct its errors would take an inordinate amount of both human and CPU time, and in the best case scenario, it would cause such undue load on the server so as to make tatoeba unusable. I've gotten the impression that Kakashi was chosen with little to no consideration of other options (c.f. below), despite the fact that there exist ways to accurately dissect Japanese text into parseable units, which could be further changed into romaji. The reality is, Kakashi is nowhere near mature enough to produce accurate results, and as an abandoned project there is little hope of it reaching that maturity -- its output will never get any better than it is. In contrast, Juman seems to be near-perfect, though I will admit that I have not tried the other romanizers suggested in the blog post, nor have I done extensive testing of Juman. Regardless, Juman seems to be acceptable, even optimal. Kakashi falls so far short of the mark that I'm not sure why it is even in use. I would even go so far as to state that if Kakashi remained the method of conversion, that by the time tatoeba becomes popular, greasemonkey scripts will be produced which correct romaji via some other means, if that's even feasible. (Here's the blog post I referenced: http://blog.tatoeba.org/2009/02...anization.html )

Nemo Nemo March 1, 2010 March 1, 2010 at 2:51:08 AM UTC link Permalink

If Juman's kana/categorization output is accurate, it can produce 100% perfect romaji output. Kana give a representation of how something is said, along with its syntactical representation. There are ambiguities in kana, but JUMAN gives enough information that the pronunciation and syntax can be reconciled to provide a perfect, phonetic, romanization.

Nemo Nemo February 23, 2010 February 23, 2010 at 1:23:11 AM UTC link Permalink

I was kinda thinking a PHPbb install or a Google Group would be effective. I'd limit access though.

Nemo Nemo February 21, 2010 February 21, 2010 at 8:33:34 PM UTC link Permalink

We need post editing, haha. JUMAN does exactly what you need. It converts from kanji to hiragana, and labels each word with what it is. So, if it says は is a 助詞 (particle), you can output wa, and the same for all of the others. I'm not sure that it outputs romaji (The sample set-up does not), but with kana and part of speech, romaji is just a lookup table away.

Nemo Nemo February 21, 2010 February 21, 2010 at 8:24:15 PM UTC link Permalink

My whole post is a waste of time, lol. The software you are using has an output to kana mode, which would not be subject to the pitfalls that romaji is. I suggest we use that. Kana is not that difficult to learn, and there's no sense in learning grammar/sentences before kana anyway.

Nemo Nemo February 21, 2010 February 21, 2010 at 8:13:59 PM UTC link Permalink

I'm for having all the romaji on the site be accurate. If the best way to do that is deleting all of the romaji, then I'd say do that. If you really want an accurate romaji representation, it will probably need to be written ad hoc. I don't think this should be too hard though, so long as it is written for this project specifically, and it is done soon. This site is currently comprised mostly of the Tanaka Corpus, so far as I am aware, so almost every word in the Japanese examples should be also present in EDICT, which has the reading of every word in it in kana. If there are multiple readings, I would just make the output something like:
僕は市場へ行った
*** boku wa (shijyou | ichiba) e itta

So that the edge cases could be fixed. It's still a lot of work, but it's doable. (In this case, the difference is irrelevant, but in many it could be relevant). You could then dump the database into a text file of all beginning with ***. I believe EDICT even has the readings listed in order of frequency, so if you wanted to you could have it just guess the first one every time, and fixing the few that got put in incorrectly would not be a huge ordeal. I would recommend keeping some automatic conversion in place, and storing things in the database as:
僕は市場へ行った
ぼくはしじょうへいった
and having the conversion take place from the kana to romaji on-the-fly. Also, force those editing the romaji to use kana. Basically introduce a learning curve that will discourage those who don't know better from thinking they do. Also, changes in romanization could be implemented very easily. I personally use wapuro romaji whenever I do, which is rare still, but I know this is less than ideal for learning.

Nemo Nemo February 21, 2010 February 21, 2010 at 5:13:45 PM UTC link Permalink

Well, by "problem" I meant that the ruby can't be edited. So it does then? I understand that the mitigation of the problem is better for Chinese, but it's still read-only, correct? I also know that it's easier for Chinese, since the reading variations are more predictable, less common, and just plain easier to get right.

Nemo Nemo February 21, 2010 February 21, 2010 at 5:04:49 PM UTC link Permalink

My point is, not everyone knows Chinese/Japanese. In fact the vast majority cannot read a single character. So they arrive at a page, seeing "中文" at the top is of little or no help, nor is "日本語". It's not a problem for you, not a problem for me, but sit some random people down at the page at tell them to navigate it.

Nemo Nemo February 21, 2010 February 21, 2010 at 4:29:00 PM UTC link Permalink

"Also, what if a Japanese sentence is wrong? The romaji will be completely different, with added, subtracted, or moved words."

I said this before I discovered where the romaji was coming from, forgot to go back and remove it before I posted, oops.

Nemo Nemo February 21, 2010 February 21, 2010 at 4:21:11 PM UTC link Permalink

Something this thread brings up, the user interface should be stored in a cookie, not as part of the URL. French is easy enough, but what if I click http://tatoeba.org/chi/sentences/show/366507/ and I'm new and/or don't read Chinese? I'm stuck and have to go back, close the window, or start over at the main page. Quick and Dirty fix would be change all of the languages to something like "English - English; Francais - French; zhongwen - Chinese; nihongo - Japanese" etc. (Not suggesting they be romanized, I just don't feel like dealing with my IME.) Not trying to be Anglocentric, but most people can decipher language names in English.

Nemo Nemo February 21, 2010 February 21, 2010 at 3:55:49 PM UTC link Permalink

You say that as if we are now at 99.98% accuracy. If you intend to have this project be a serious attempt for people to learn, please hide the Romaji if it's not going to be fixable. Currently, much of it is an abomination of on/kun mixups, and idiosyncratic spacing. Any beginners will just be confused by it, and anyone who knows how wrong it is will be frustrated by it. Also, what if a Japanese sentence is wrong? The romaji will be completely different, with added, subtracted, or moved words. If nothing else, add Romaji as a "Language". Long-term however, we're going to need some solution. I assume Chinese has the same issue, and Shanghainese likely does or will, too (I haven't seen any Shanghainese with ruby text yet). Eventually we'll be swimming in meta-languages. I'd say just nuke all romanization until it's done right. Also, doesn't wwwjdict already have every example sentence broken down into kana? I've used it in the past, but it was far too long ago for me to remember.