menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search

Wall (6,206 threads)

Tips

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

Latest messages subdirectory_arrow_right

small_snow

8 hours ago

subdirectory_arrow_right

Pfirsichbaeumchen

13 hours ago

subdirectory_arrow_right

gillux

14 hours ago

subdirectory_arrow_right

small_snow

15 hours ago

subdirectory_arrow_right

Pfirsichbaeumchen

15 hours ago

subdirectory_arrow_right

gillux

16 hours ago

feedback

lbdx

23 hours ago

feedback

small_snow

yesterday

subdirectory_arrow_right

raeldor

yesterday

subdirectory_arrow_right

small_snow

2 days ago

blay_paul blay_paul April 23, 2010 April 23, 2010 at 9:25:22 PM UTC link Permalink

Unique identification of a JMDICT entry.

This is technical stuff, only really of interest for people who want to deal with the 'index' field of Japanese sentences.

The obvious way to identify a dictionary entry as used in WWWJDIC + Edict is by the entry number (duh!). However that's not how it's done in the index field. Why? Because 2147630 is not 'human friendly' for whoever is creating and editing the index fields. (i.e. you can't look at 2147630 and know what word it refers to)

You could identify by the headword of the dictionary entry - あっという間に will only match one record. However there are around 3,000 dictionary entries where that will not be enough. 前(まえ) is not the same entry as 前(ぜん). So, for ambiguous kanji headwords, you include the reading of the word as well.

You've now reached the basic method used by Jim when he links WWWJDIC to example records. However that is not enough to identify 100% of the entries in JMdict uniquely. There are kana headwords like 'は' that are present in more than one entry.

はあ; は (int) (1) yes; indeed; well; (2) ha!; (3) what?; huh?; (4) sigh

は (prt) (1) (pronounced わ in modern Japanese) topic marker particle; (2) indicates contrast with another option (stated or unstated); (3) adds emphasis; (P)

Note that the second [EX] link in the はあ; は entry is actually for the particle は. There are a few like this that can only be fixed by re-doing the indexing system that Jim uses (in a way that would be more complicated and slower than he wants to deal with).

Now we come to the notation used in Tatoeba (and that I used 'at home' when maintaining the Tanaka example collection. Every headword that is not unique has the notation |1, |2, |3, added.

So instead of 前(さき) 前(ぜん) 前(まえ)
you might have 前|1(さき) 前|2(ぜん) 前|3(まえ)

Some points to note:
* The numbers are assigned in order by JMDict entry ID. So 前|1(さき) (Entry 1387210) comes before 前|2(ぜん) (Entry 1392570) and 前|3(まえ) (Entry 1392580)
* Because I used Access (and Excel) some characters are treated as 'equivalent' by Microsoft that are not actually identical. So ヽ|1 ヾ|2 ゝ|3 ゞ|4 and 々|5 all need to be distinguished by the numerical notation.
* The order of headwords / readings is significant in JMDict - most common headwords are supposed to come first, least common last. When indexing the preference is to use the first headword / reading when possible.

NOTE that the point of the index data is to uniquely identify a dictionary entry, NOT to reflect 100% accurately the dictionary form and reading of the word being indexed.

e.g. If you used 挙がります in a sentence the index would include
上がる{挙がります}
it would not include
挙がる{挙がります}

Apart from anything else this ensures that there is only one [EX] link from dictionary entries.

Finally the square brackets note the sense of the word
上がる[02] = Second sense of the word 上がる
and the curly brackets show the /exact match/ for the indexed word in the sentence.

僕は学校の成績が上がった。
僕|2(ぼく)[01] は|1 学校 乃{の} 成績 が 上がる{上がった}

*Psst* Trang - maybe this should be on a page somewhere?

blay_paul blay_paul April 23, 2010 April 23, 2010 at 8:44:54 PM UTC link Permalink

Busy signal

Just to note that I'm going to be really busy with fixing lots and lots of stuff with the Japanese Index field for the next few days.

If you left a comment on a sentence you'd like me to read, please PM me because I won't have time to deal with it now.

CK CK April 22, 2010, edited December 3, 2018 April 22, 2010 at 3:58:04 PM UTC, edited December 3, 2018 at 11:18:33 PM UTC link Permalink

[removed by CK, since this no longer applies]

{{vm.hiddenReplies[557] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 22, 2010 April 22, 2010 at 4:17:43 PM UTC link Permalink

Just looking at a Google search on
"と言った(の)も同然だ"
it looks like the version with の (the second one) should be got rid of.

CK CK April 22, 2010, edited October 25, 2019 April 22, 2010 at 3:14:59 PM UTC, edited October 25, 2019 at 8:02:50 AM UTC link Permalink

[not needed anymore- removed by CK]

{{vm.hiddenReplies[555] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 22, 2010 April 22, 2010 at 3:22:44 PM UTC link Permalink

Hmmm. "He adopted a war orphan and is bringing her up as a foster daughter."

Comment on the sentence itself and Francis will get a message.

blay_paul blay_paul April 18, 2010 April 18, 2010 at 3:48:07 PM UTC link Permalink

Bump.

Examples.gz, how often is it updated?

On the Monash FTP Archive page it says:

[...] It is updated daily from the server site.
* examples.gz (8371412 bytes) the file.

But I think that is no longer accurate. The one I've just downloaded hasn't been updated from a week ago. Also could the ID numbers given in examples.gz be from the Japanese sentences, not the English sentences? The English ones aren't unique so the IDs are pretty much useless.

{{vm.hiddenReplies[525] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 18, 2010 April 18, 2010 at 5:06:17 PM UTC link Permalink

You'll have to ask this to Jim for this because I don't think anyone else has the answer here ^^
It may be faster to simply send him an email...

{{vm.hiddenReplies[527] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 18, 2010 April 18, 2010 at 5:14:21 PM UTC link Permalink

Could you make available a download with the stuff Jim uses for WWWJDIC? i.e.
a) All Japanese sentences that have an Index field linked. (With sentence number of Japanese sentences)
b) All English fields that are mentioned in the 'Meaning' field. (With sentence number of English sentences)
c) All index fields. (With 'Meaning' field).

{{vm.hiddenReplies[529] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 18, 2010 April 18, 2010 at 5:20:36 PM UTC link Permalink

I haven't had time to update the "downloads" page yet, but the file data Jim uses can be downloaded here:

http://tatoeba.org/app/webroot/files/downloads/
(wwwjdic.csv)

The fields are:
jpn_sentence_id, eng_sentence_id, jpn_text, eng_text, jpn_index

{{vm.hiddenReplies[530] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 19, 2010 April 19, 2010 at 8:57:30 PM UTC link Permalink

Just one more question - how often do you update the files on the download page?

{{vm.hiddenReplies[544] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 20, 2010 April 20, 2010 at 9:46:02 AM UTC link Permalink

On this page:
http://tatoeba.org/app/webroot/files/downloads/
Once a week. On Saturdays around 9AM France time.

On the download page that you can access from the link at the bottom, never. I have to update that page though, to link to the files in http://tatoeba.org/app/webroot/files/downloads/.

blay_paul blay_paul April 18, 2010 April 18, 2010 at 6:05:52 PM UTC link Permalink

That should do nicely. Thanks.

The other thing is that I'd like to do a complete check and revamp of the index field. To be certain of not losing any data I'd need you to lock the index field so I can download / fix up / upload without any changes happening on your side.

I'm still working on things so I probably won't be ready for a week or so.

{{vm.hiddenReplies[531] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 19, 2010 April 19, 2010 at 7:48:06 AM UTC link Permalink

Can you keep me in the loop. I change the odd index when making a correction to a Japanese sentence. Also, when I do the weekly download I run it though a utility that checks that the index and sentence agree. That way I can detect when others have changed a sentence. I usually have to update the index and occasionally add to the list of names to be ignored (e.g. ムーリエル and 赤ずきん this week.)

In addition, I have a list of words from Collin McCulley which had mismatches between the index and dictionary. I have cleaned up most of them, but still have ~100. We need some way of tracking when they get out of kilter, mainly when dictionary entries need qualifying.

TRANG TRANG April 19, 2010 April 19, 2010 at 7:14:59 PM UTC link Permalink

@Paul, yes, I can easily block the access to the indices to everyone.

JimBreen JimBreen April 19, 2010 April 19, 2010 at 7:38:17 AM UTC link Permalink

I download the file Trang sets up once a week. I check it over then set it up in the WWWJDIC system. At that stage the "examples.gz" file, etc. is rebuilt.

JimBreen JimBreen April 19, 2010 April 19, 2010 at 7:59:36 AM UTC link Permalink

I missed the bit about the ID numbers. I use the English sentence number because 90%+ of the corrections coming from WWWJDIC users are to the English sentence, so it makes sense for WWWJDIC to link there.

As discussed on another forum, I could put in both - e.g.#ID=375963_12345. I have a major change half done in WWWJDIC which is blocking other changes. Once it is clear (maybe a week or so) I can make that sentence number change. I'll enable WWWJDIC users to select whether they want to link to the Japanese or English.

blay_paul blay_paul April 18, 2010 April 18, 2010 at 9:22:46 PM UTC link Permalink

Example export system.

Japanese sentences with multiple English translations are (sometimes?) being exported with both versions.

For example:

4924 73899 「これが探していたものだ」と彼は叫んだ。 "This is what I was looking for," he exclaimed. 此れ[01]{これ} が 探す{探していた} 物|1(もの)[01]{もの} だ と|1 彼|2(かれ)[01] は|1[01] 叫ぶ{叫んだ}
4924 1513 「これが探していたものだ」と彼は叫んだ。 "This is what I was looking for!" he exclaimed. 此れ[01]{これ} が 探す{探していた} 物|1(もの)[01]{もの} だ と|1 彼|2(かれ)[01] は|1[01] 叫ぶ{叫んだ}

On the sentence annotations page only 73899 is given as the 'meaning' field with the Index field. So either a) The meaning field in the sentence annotations page isn't used, or b) There can be two or more 'meaning' field values, but only one is shown on the sentence annotations page.

{{vm.hiddenReplies[534] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 19, 2010 April 19, 2010 at 7:52:34 AM UTC link Permalink

I've noticed that, and I have assumed it wasn't used. When I notice cases such as that one, I have changed the English so they become identical, and hope it will lead to the removal of one of them

{{vm.hiddenReplies[538] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 19, 2010 April 19, 2010 at 9:13:38 AM UTC link Permalink

That's OK for WWWJDIC, and for cases where one is mistaken or they are both very similar.

It's not good 'Tatoeba practice' though.

{{vm.hiddenReplies[540] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 19, 2010 April 19, 2010 at 9:27:20 AM UTC link Permalink

In the case you quoted I thought it *was* the Tatoeba practice. They only differ by an exclamation mark.

Where they differ in more substantial ways, e.g. choice of personal pronoun where the Japanese has none, I guess there is a case for both being kept, and that's a situation where an index is best tied to a sentence-pair rather than just to the Japanese.

Tying to sentence pairs has another problem. The number of sentences in German and French is getting to the stage where it would be nice to include them in WWWJDIC. I'd be looking to having "examples_de" and examples_fr" extracts along with indices.

{{vm.hiddenReplies[541] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 19, 2010 April 19, 2010 at 9:38:16 AM UTC link Permalink

> In the case you quoted I thought it *was* the Tatoeba
> practice. They only differ by an exclamation mark.

The case I quoted, yes. There are, however, at least a handful where both alternatives are valid, significantly different, and illustrate something about the English language.

blay_paul blay_paul April 18, 2010 April 18, 2010 at 8:22:37 PM UTC link Permalink

.csv format in downloads.

Just a note for those using the downloads. You use \ as the escape character.

This is a line from your csv file:

"4923";"1512";"「信用して」と彼は言った。";"\"Trust me,\" he said.";"信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}"

This is how it appears when loaded into Excel.

4923 1512 「信用して」と彼は言った。 \Trust me,\" he said." 信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}

Excel uses double " marks when escaping quotes. The same line in csv for Excel would be...

"4923";"1512";"「信用して」と彼は言った。";"""Trust me,"" he said.";"信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}"

Which imports to Excel as follows:

4923 1512 「信用して」と彼は言った。 "Trust me," he said. 信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}

I think the 'escaping with extra quote mark' may be the more standard version ...

kellenparker kellenparker April 18, 2010 April 18, 2010 at 3:29:03 PM UTC link Permalink

Right. So. I'm not 13 years old. It was an honest mistake. Here's the problem:

I was wondering if Tatoeba had any sort of resistance to profanity. I thought something like "damnit" would be a common enough thing. So I MEANT to SEARCH for "damn". Turns out I added it as a sentence instead. Same for "fuck" because it took me two tries to realise I was using the wrong text box.

So those can be deleted outright. I didn't see any way to do it so I abandoned the sentences instead in case there is such a way and someone else wants to adopt them to get them deleted. 380292 and 380290.

Apologies.

{{vm.hiddenReplies[522] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 18, 2010 April 18, 2010 at 3:40:42 PM UTC link Permalink

Hahaha, it's fine. It's alos our mistake, it means we need to change the form to make it clearer that it adds a new sentence.

I will delete your entries. There's currently no way for users to delete sentences, only admins can. The only solution when you want to "delete" a sentence is to replace it by a sentence that you actually want to keep.

As for profanity, we don't have anything against it, but we'd rather avoid it until we set up a mechanism to filter out sentences that are "not safe" for kids.

{{vm.hiddenReplies[523] ? 'expand_more' : 'expand_less'}} hide replies show replies
kellenparker kellenparker April 18, 2010 April 18, 2010 at 3:44:33 PM UTC link Permalink

Good to know. And actually it's still pretty much entirely my fault. I searched at the top but then I think I stopped paying attention so when it sent me to the page saying "Nope, but you can add a sentence:", I thought I was searching again. It's not the form's issue. It's my attention span's issue.

sysko sysko April 18, 2010 April 18, 2010 at 4:48:26 PM UTC link Permalink

And for profanities , we have some "colorful" sentences (spoiler : "search XXX in the search engine")

sysko sysko April 18, 2010 April 18, 2010 at 2:39:06 PM UTC link Permalink

stemming should be working again for most languages when using the search engine
i.e search "think" should also return "thinking" "thought" etc. same for French / Spanish / Italian / Russian etc.

by the way it will not work with Ukrainian but I was wondering if using the russian stemmer will produce "better than nothing" result ? Demetrius, Dorenda ?
still looking for Arabic and georgian stemmers

{{vm.hiddenReplies[519] ? 'expand_more' : 'expand_less'}} hide replies show replies
Dorenda Dorenda April 18, 2010 April 18, 2010 at 3:02:30 PM UTC link Permalink

Probably it will. But maybe there is a way to adapt the Russian stemmer into a Ukrainian one (or at least something more fit to Ukrainian)? I have no idea how those things work or how much work it would be, but if it's feasible, I could help with that.

{{vm.hiddenReplies[520] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 18, 2010 April 18, 2010 at 3:19:34 PM UTC link Permalink

globally how the stemmer works for russian is explained here http://snowball.tartarus.org/al...n/stemmer.html , I admit I haven't read it entirely, as I've no notion in Russian (and moreover they provided something which work out of the box for this).

So I dunno how "easy' it is to adapt this to Ukrainian.

{{vm.hiddenReplies[521] ? 'expand_more' : 'expand_less'}} hide replies show replies
Dorenda Dorenda April 18, 2010 April 18, 2010 at 5:08:13 PM UTC link Permalink

It looks doable. I'd just have to adapt it to the Ukrainian alphabet, change the endings into their Ukrainian counterparts and add/remove some endings that either of the two languages doesn't have.

So I'd have to just change that piece of script on the blue background, right?

{{vm.hiddenReplies[528] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 18, 2010 April 18, 2010 at 10:55:00 PM UTC link Permalink

yep this one http://snowball.tartarus.org/al...em_Unicode.sbl to be more precise :) thanks

{{vm.hiddenReplies[535] ? 'expand_more' : 'expand_less'}} hide replies show replies
Dorenda Dorenda April 24, 2010 April 24, 2010 at 3:55:48 PM UTC link Permalink

Okay, I adapted it. The results won't always be right, though, cause sometimes it's just not possible to see from the form of a word what type of word it is and thus what belongs to the ending. For example, "koromyslo" is a noun, so only "o" should be removed, but the script will think it's a past tense verb and remove "lo". I tried to choose the least bad options...
Anyway, is there some way to test it? And where should I send it?

And one more question. How can I make the thing also remove the superlative prefix '{n}{a}{i'}' from the beginning of words?

{{vm.hiddenReplies[584] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 24, 2010 April 24, 2010 at 4:43:31 PM UTC link Permalink

send us the file to our email address team [at] tatoeba [dot] org, and i will see how to integrate it.
to be honnest i don't really how it works (A) at least I will contact the guys of this project to see what can we do:),
but it's already great if you have adapted it to Ukrainian

saeb saeb April 17, 2010 April 17, 2010 at 8:48:46 PM UTC link Permalink

Congrats on the new server! I can already feel the site is 100x faster. oh and I'm in love with the new inbox, great update!...now are we cool or are we cool :)

{{vm.hiddenReplies[513] ? 'expand_more' : 'expand_less'}} hide replies show replies
Dorenda Dorenda April 17, 2010 April 17, 2010 at 11:05:49 PM UTC link Permalink

You're cool. :)
It's so much faster, great! :D
(And I just loved that note we got while the site didn't work. :))