menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search

Wall (6,341 threads)

Tips

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

Latest messages feedback

sharptoothed

6 minutes ago

subdirectory_arrow_right

TRANG

11 hours ago

subdirectory_arrow_right

Ricardo14

2 days ago

feedback

sharptoothed

3 days ago

subdirectory_arrow_right

Ricardo14

3 days ago

subdirectory_arrow_right

ddnktr

3 days ago

feedback

Ricardo14

3 days ago

feedback

lbdx

4 days ago

subdirectory_arrow_right

CK

5 days ago

subdirectory_arrow_right

TRANG

5 days ago

contour contour March 23, 2010 March 23, 2010 at 3:00:21 AM UTC link Permalink

A proposal for the Japanese romanisation:
I think both romaji and kana readings should be shown on the site. While there's some agreement that serious students of the language should be reading kana, having the romaji makes the site more accessible for silly things like learning to say "I love you" in twenty languages.

As for how to generate it, I'd suggest using the B lines where possible. If there's no B line, or if the B line does not match the text (for instance because of names in the sentence), generate the reading with MeCab, which looks pretty solid.
This may require names and other unindexed items to be added to a B line if the romanisation needs to be corrected. Just the reading will do.

Does this sound reasonable?

I'm motivated to work on this if necessary, but it will probably be a little while before I have the time. First up would be to create the B line to reading converter, and then use it to test MeCab's accuracy on our data.

There are entries like '|1' in front of most parentheses that aren't described at http://www.edrdg.org/wiki/index.php/Tanaka_Corpus. I'm guessing they're indices for the reading?

{{vm.hiddenReplies[380] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen March 23, 2010 March 23, 2010 at 6:37:19 AM UTC link Permalink

Much as I love those indices in the B-lines (I invented them years ago), I think it might be better to go straight to MeCab. There are some tricks you would need to apply, e.g. where MeCab says a particle is "助詞,格助詞" you would leave it with spaces around it, and where it is "助詞,接続助詞" you would attach it to the preceding word.

Those "|1" are an artifact of the days when Paul Blay was maintaining the indices in MSAccess and needed a way of disambiguating some words. They are not carried through to the B lines in WWWJDIC (I didn't know what they were until Trang expained them to me.)

If you are using MeCab, use the IPADIC version.

The B-lines would be necessary if you you were to create links to WWWJDIC, as MeCab breaks up expressions/compound nouns/etc.

{{vm.hiddenReplies[381] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko March 23, 2010 March 23, 2010 at 8:52:19 AM UTC link Permalink

one of us has beginning to look into changing kakasi, but we have never used MeCab and so, so maybe if you want, I can give you his mail in order to see how to use/configure mecab ?

{{vm.hiddenReplies[382] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen March 23, 2010 March 23, 2010 at 12:01:01 PM UTC link Permalink

Sure. I have only used it on as a command-line too (and in shell scripts), but I see it has bindings for python, perl, ruby and java. I just installed it with apt-get (Debian). I use the ipadic (mecab-ipadic) ratjer than the default juman dictionary.

You need to make sure you get the utf-8 configuration. Mine is euc-jp.

It's simple to use: "echo 日本語の分節 | mecab".

Feel free to ask.

{{vm.hiddenReplies[383] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko March 27, 2010 March 27, 2010 at 4:59:56 PM UTC link Permalink

the output seems to contain only katakana, is there a way to have hirigana ?

{{vm.hiddenReplies[384] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen March 27, 2010 March 27, 2010 at 11:26:13 PM UTC link Permalink

Only by doing your own conversion - those morphological analysis systems don't really care whether it's one or the other.

In EUC-JP and in raw Unicode the conversion is simple, e.g. あ is 3041 and ア is 30A1 and so on. It's a little more messy in UTF8 but quite doable with a simple algorithm. Of course where it is katakana in the original text, you would leave it that way.

{{vm.hiddenReplies[385] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko March 27, 2010 March 27, 2010 at 11:36:07 PM UTC link Permalink

ok that's what we've done waiting your answer, so we will keep it :)
so it's highly probable that it will be included in next release

JimBreen JimBreen March 21, 2010 March 21, 2010 at 6:11:23 AM UTC link Permalink

Traditional and Simplified Chinese

I saw the comment about converting hanzi on-the-fly. Be very cautious about that, as there are many cases where it simply doesn't work. Proper Traditional<->Simplified conversion needs to work at the lexeme level and in some cases needs some context for disambiguation.

Jack Halpern wrote a very good paper about this about 10 years ago:
http://www.cjk.org/cjk/c2c/c2cbasis.htm

PS: how do I make a comment on another posting?

{{vm.hiddenReplies[377] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen March 21, 2010 March 21, 2010 at 6:40:28 AM UTC link Permalink

OK, I worked out how to do a follow-on. I'd clicked "reply" but it hadn't worked. Now it does.

sysko sysko March 21, 2010 March 21, 2010 at 11:10:11 AM UTC link Permalink

the traditional to simplified chinese is not made at "character by character" level, but try to decompose the sentence (you can see how the sentence has been segmented by looking to pinyin)
As I've said I'm in conctact with the guy who develop it, so don't hesitate to report any bad segmentations, I will report to him

blay_paul blay_paul March 2, 2010 March 2, 2010 at 11:46:10 AM UTC link Permalink

WWWJDIC index line.

I suggest adding links from words in the Japanese sentence to WWWJDIC entries using the information in the index line. That would be a useful 'first step' towards adding furigana to the sentence.

The basic set-up is relatively straight forward, but there is one complication - namely 'deliberately non-indexed text'. Punctuation, English words, place names and other proper nouns are not generally included in EDICT and so do not have entries in the Index line. Jim Breen should have a 'no index' field that includes all non-indexed text (although it may not be up to date). In order to parse a sentence properly you need both the index line and the non-indexed text.

Adding furigana to place names etc. should probably be left for later.

{{vm.hiddenReplies[276] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen March 21, 2010 March 21, 2010 at 6:39:18 AM UTC link Permalink

Word-by-word links based on the Japanese index words would be good, and not too hard to implement, I think.

At present I am pulling the sentences and indices into WWWJDIC once a week, and I put them through a utility which matches the text and the index contents, and reports if there is a mismatch (which usually means that someone has changed a sentence.) To get around the problem of "deliberately non-indexed text" I have a file of
words which I ignore if they are not in the index. You can see this list of words at http://www.csse.monash.edu.au/~...amplestopwords (in EUC-JP). Most are names. Some look a bit odd as they are two or more names which had been separated by punctuation (which I ignore.)

blay_paul blay_paul March 20, 2010 March 20, 2010 at 7:20:37 PM UTC link Permalink

Translation suggestions

There are now 100 translation suggestions waiting to be checked at
https://translations.launchpad..../ja/+translate

I would urge people who understand Japanese to check them and either confirm or correct them.

blay_paul blay_paul March 19, 2010 March 19, 2010 at 6:02:52 PM UTC link Permalink

Source Code?

In order to better determine possible translations for
https://translations.launchpad..../ja/+translate
it would help if I could view the source code.

Also, some translation items require code reworking as well. e.g. "linked to" should probably be "linked to » %s" (where %s is the sentence number) so that the Japanese could be something like "%s とつながる".

{{vm.hiddenReplies[363] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG March 20, 2010 March 20, 2010 at 1:09:59 PM UTC link Permalink

Normally the path of the file should be enough of a hint for you to figure out on which page of the website the string can be found.

If the path is something like /views/<something>/file.ctp file, then you would usually (not always) need to go to http://tatoeba.org/<something>/file

If the path is /controllers/<something>_controller.php, then the string is a bit harder to find, but it can be found somewhere in the pages that start with http://tatoeba.org/<something>/

I don't know how comfortable you are looking at source code, but it could be simpler if you just translated what you can first. We have a "test" version of Tatoeba where we test things before we update the "real" version of Tatoeba. As soon as you have your translations done (even partially), we can update the "test" version and you can then browse around in there to check if the translations fit or not. I'll give you the link in a private message.

Other than that, the source code can be found here:
http://subversion.assembla.com/...ba2/trunk/app/
Just note that the strings in Launchpad are not always exactly synchronized with the code source.

{{vm.hiddenReplies[364] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul March 20, 2010 March 20, 2010 at 2:34:06 PM UTC link Permalink

> I don't know how comfortable you are looking at source code

Reasonably. I'm familiar with Visual Basic, Javascript and Visual Basic - PhP is like the bastard offspring of all of those.

Without looking at the code it's very difficult to correctly translate things like

<b>Share</b> your knowledge.

because they are handled as _two strings_ and the order needs to be reversed in Japanese.

<b>知識</b>を共有する

{{vm.hiddenReplies[365] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG March 20, 2010 March 20, 2010 at 2:40:22 PM UTC link Permalink

Ah yes, forgot to mention, like you noted, there are some strings that we forgot to make more "compliant" for internationalization.

You can send me an email to list those you find. I'll fix it in the code and update the strings in Launchpad.

saeb saeb March 18, 2010 March 18, 2010 at 7:21:25 PM UTC link Permalink

Question, which places more strain on the server: generating sentences using a keyword query or using the random sentence generator?

{{vm.hiddenReplies[361] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko March 19, 2010 March 19, 2010 at 9:45:52 AM UTC link Permalink

random sentences for sure, mysql doesn't like random at all ^^ we're on the way to try to make it faster

jaystarkey jaystarkey March 15, 2010 March 15, 2010 at 9:58:23 PM UTC link Permalink

Would be really cool if we could add audio someday to the example sentences ;-)

{{vm.hiddenReplies[358] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko March 16, 2010 March 16, 2010 at 11:38:59 AM UTC link Permalink

We plan to do so, you will have more details and maybe a proof of concet at the beginning of April :)

{{vm.hiddenReplies[359] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift March 28, 2010 March 28, 2010 at 11:30:09 AM UTC link Permalink

Looking forward to that!

Wolf Wolf March 15, 2010 March 15, 2010 at 12:25:42 AM UTC link Permalink

Did you change something with the database dump? This Saturday's jpn_indices contain invalid utf8 characters and the affected lines seem to be truncated.

The following sentence ids have problems: 83767, 91272, 140460, 146080, 152054, 190707, 195118, 199753, 205628, 211131, 213530, 235850

{{vm.hiddenReplies[356] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG March 15, 2010 March 15, 2010 at 6:46:14 PM UTC link Permalink

Ah, indeed, indeed. I had changed the 'text' field from varchar to varbinary, but kept the length to 500. That's why those entries were truncated. I've fixed it and did a new export of the jpn_indices.


> This Saturday's jpn_indices [...]

How do you know about that by the way? I don't remember making it official yet, that the download files would be upadted on Saturdays. (or did I? o.o)

saeb saeb March 13, 2010 March 13, 2010 at 5:34:18 PM UTC link Permalink

1000+ sentences in arabic.

I'd like 2 thank everyone that has ever thanked everyone. On behalf of all of us you've thanked I say thank u for thanking us.

lol dane cook is brilliant :D

{{vm.hiddenReplies[354] ? 'expand_more' : 'expand_less'}} hide replies show replies
MUIRIEL MUIRIEL March 14, 2010 March 14, 2010 at 11:51:42 AM UTC link Permalink

:D thank you^^.

saeb saeb March 13, 2010 March 13, 2010 at 5:30:57 PM UTC link Permalink

I believe in ghosts. I believe in aliens. But theres no way u will ever persuade me into believing in alien ghosts. Ridiculous.

I believe in the sentence method. I believe in language websites. But theres no way u will ever persuade me into believing in sentence websites. Ridiculous

yay! first tatoeba joke :P (hmm I wonder if I can consider this a wall abuse..)

{{vm.hiddenReplies[351] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb March 13, 2010 March 13, 2010 at 5:49:21 PM UTC link Permalink

TRANG says:

omg you're so funny, stop "abusing" the wall :D

{{vm.hiddenReplies[352] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG March 14, 2010 March 14, 2010 at 12:04:42 PM UTC link Permalink

I should just mention I never said that :P
But I do think it. Well, especially the "abusing the wall" part, because now I'm working on figuring out how to paginate this wall. Certainly there will be more abuse.