Dês - Tatoeba

Dês (mewzûyêk)

Tîpî

Verê perskerdişê persêk xeyrê xo persa xo ser o PZP de cigêrayîş bikerê.

Ma wazenîme ke seba munaqeşeyanê medenîyan atmosferêko rindane îdame bikerîme. Xeyrê xo qaydeyanê ma yê verba hereketanê xiraban biwanê.

Mesajê tewr peyênî

subdirectory_arrow_right

Babelball

vizêr

subdirectory_arrow_right

TATAR1

vizêr

subdirectory_arrow_right

LeviHighway

vizêr

subdirectory_arrow_right

AlanF_US

vizêr

feedback

LeviHighway

vizêr

subdirectory_arrow_right

LeviHighway

vizêr

subdirectory_arrow_right

gillux

vizêr

subdirectory_arrow_right

gillux

vizêr

subdirectory_arrow_right

gillux

vizêr

subdirectory_arrow_right

Babelball

vizêr

blay_paul April 23, 2010 April 23, 2010 at 9:29:31 PM UTC

flag

Report

link

Lînko payîdar

Work in progress

There are currently lots of index records that need altering because words have been added / removed / changed in Edict.

For example
* headwords that were unique may no longer be unique (so readings need to be added).
* Entries that were once two separate dictionary records may have been merged into one (so indexing needs to be changed to one keyword only as well)
* Adding, removing and merging words may leave the |1, |2, ... notations needing an update.
* There is also a great deal of general checking to do.

cewaban binimne cewaban bimojne

JimBreen April 24, 2010 April 24, 2010 at 12:46:43 AM UTC

flag

Report

link

Lînko payîdar

The first two of these can be checked automatically, although it's not a small task. Re the "|1, |2, ... notations", are they still needed?

cewaban binimne cewaban bimojne

blay_paul April 24, 2010 April 24, 2010 at 3:46:56 AM UTC

flag

Report

link

Lînko payîdar

> The first two of these can be checked automatically,
> although it's not a small task.

Probably best not done automatically. Checking like that always brings up a bunch of things to check/correct in Edict. (c.f. today's, and yesterday's submissions ;-)

It also familiarises me with the words/readings in use in the examples (all 28,000 odd of them :-P

> Re the "|1, |2, ... notations", are they still needed?
I still need them (or something equivalent). For sanity checking, as much as anything. Perfectionists (and people using Microsoft products) still need them. (No comments about the intersect being a null set, please)

blay_paul April 23, 2010 April 23, 2010 at 9:25:22 PM UTC

flag

Report

link

Lînko payîdar

Unique identification of a JMDICT entry.

This is technical stuff, only really of interest for people who want to deal with the 'index' field of Japanese sentences.

The obvious way to identify a dictionary entry as used in WWWJDIC + Edict is by the entry number (duh!). However that's not how it's done in the index field. Why? Because 2147630 is not 'human friendly' for whoever is creating and editing the index fields. (i.e. you can't look at 2147630 and know what word it refers to)

You could identify by the headword of the dictionary entry - あっという間に will only match one record. However there are around 3,000 dictionary entries where that will not be enough. 前(まえ) is not the same entry as 前(ぜん). So, for ambiguous kanji headwords, you include the reading of the word as well.

You've now reached the basic method used by Jim when he links WWWJDIC to example records. However that is not enough to identify 100% of the entries in JMdict uniquely. There are kana headwords like 'は' that are present in more than one entry.

はあ; は (int) (1) yes; indeed; well; (2) ha!; (3) what?; huh?; (4) sigh

は (prt) (1) (pronounced わ in modern Japanese) topic marker particle; (2) indicates contrast with another option (stated or unstated); (3) adds emphasis; (P)

Note that the second [EX] link in the はあ; は entry is actually for the particle は. There are a few like this that can only be fixed by re-doing the indexing system that Jim uses (in a way that would be more complicated and slower than he wants to deal with).

Now we come to the notation used in Tatoeba (and that I used 'at home' when maintaining the Tanaka example collection. Every headword that is not unique has the notation |1, |2, |3, added.

So instead of 前(さき) 前(ぜん) 前(まえ)
you might have 前|1(さき) 前|2(ぜん) 前|3(まえ)

Some points to note:
* The numbers are assigned in order by JMDict entry ID. So 前|1(さき) (Entry 1387210) comes before 前|2(ぜん) (Entry 1392570) and 前|3(まえ) (Entry 1392580)
* Because I used Access (and Excel) some characters are treated as 'equivalent' by Microsoft that are not actually identical. So ヽ|1 ヾ|2 ゝ|3 ゞ|4 and 々|5 all need to be distinguished by the numerical notation.
* The order of headwords / readings is significant in JMDict - most common headwords are supposed to come first, least common last. When indexing the preference is to use the first headword / reading when possible.

NOTE that the point of the index data is to uniquely identify a dictionary entry, NOT to reflect 100% accurately the dictionary form and reading of the word being indexed.

e.g. If you used 挙がります in a sentence the index would include
上がる{挙がります}
it would not include
挙がる{挙がります}

Apart from anything else this ensures that there is only one [EX] link from dictionary entries.

Finally the square brackets note the sense of the word
上がる[02] = Second sense of the word 上がる
and the curly brackets show the /exact match/ for the indexed word in the sentence.

僕は学校の成績が上がった。
僕|2(ぼく)[01] は|1 学校乃{の} 成績が上がる{上がった}

*Psst* Trang - maybe this should be on a page somewhere?

blay_paul April 23, 2010 April 23, 2010 at 8:44:54 PM UTC

flag

Report

link

Lînko payîdar

Busy signal

Just to note that I'm going to be really busy with fixing lots and lots of stuff with the Japanese Index field for the next few days.

If you left a comment on a sentence you'd like me to read, please PM me because I won't have time to deal with it now.

CK April 22, 2010, December 3, 2018 de ame/ê pergalkerdene April 22, 2010 at 3:58:04 PM UTC, December 3, 2018 at 11:18:33 PM UTC de ame/ê pergalkerdene

flag

Report

link

Lînko payîdar

[removed by CK, since this no longer applies]

cewaban binimne cewaban bimojne

blay_paul April 22, 2010 April 22, 2010 at 4:17:43 PM UTC

flag

Report

link

Lînko payîdar

Just looking at a Google search on
"と言った(の)も同然だ"
it looks like the version with の (the second one) should be got rid of.

CK April 22, 2010, October 25, 2019 de ame/ê pergalkerdene April 22, 2010 at 3:14:59 PM UTC, October 25, 2019 at 8:02:50 AM UTC de ame/ê pergalkerdene

flag

Report

link

Lînko payîdar

[not needed anymore- removed by CK]

cewaban binimne cewaban bimojne

JimBreen April 22, 2010 April 22, 2010 at 3:22:44 PM UTC

flag

Report

link

Lînko payîdar

Hmmm. "He adopted a war orphan and is bringing her up as a foster daughter."

Comment on the sentence itself and Francis will get a message.

blay_paul April 18, 2010 April 18, 2010 at 3:48:07 PM UTC

flag

Report

link

Lînko payîdar

Bump.

Examples.gz, how often is it updated?

On the Monash FTP Archive page it says:

[...] It is updated daily from the server site.
* examples.gz (8371412 bytes) the file.

But I think that is no longer accurate. The one I've just downloaded hasn't been updated from a week ago. Also could the ID numbers given in examples.gz be from the Japanese sentences, not the English sentences? The English ones aren't unique so the IDs are pretty much useless.

cewaban binimne cewaban bimojne

TRANG April 18, 2010 April 18, 2010 at 5:06:17 PM UTC

flag

Report

link

Lînko payîdar

You'll have to ask this to Jim for this because I don't think anyone else has the answer here ^^
It may be faster to simply send him an email...

cewaban binimne cewaban bimojne

blay_paul April 18, 2010 April 18, 2010 at 5:14:21 PM UTC

flag

Report

link

Lînko payîdar

Could you make available a download with the stuff Jim uses for WWWJDIC? i.e.
a) All Japanese sentences that have an Index field linked. (With sentence number of Japanese sentences)
b) All English fields that are mentioned in the 'Meaning' field. (With sentence number of English sentences)
c) All index fields. (With 'Meaning' field).

cewaban binimne cewaban bimojne

TRANG April 18, 2010 April 18, 2010 at 5:20:36 PM UTC

flag

Report

link

Lînko payîdar

I haven't had time to update the "downloads" page yet, but the file data Jim uses can be downloaded here:

http://tatoeba.org/app/webroot/files/downloads/
(wwwjdic.csv)

The fields are:
jpn_sentence_id, eng_sentence_id, jpn_text, eng_text, jpn_index

cewaban binimne cewaban bimojne

blay_paul April 19, 2010 April 19, 2010 at 8:57:30 PM UTC

flag

Report

link

Lînko payîdar

Just one more question - how often do you update the files on the download page?

cewaban binimne cewaban bimojne

TRANG April 20, 2010 April 20, 2010 at 9:46:02 AM UTC

flag

Report

link

Lînko payîdar

On this page:
http://tatoeba.org/app/webroot/files/downloads/
Once a week. On Saturdays around 9AM France time.

On the download page that you can access from the link at the bottom, never. I have to update that page though, to link to the files in http://tatoeba.org/app/webroot/files/downloads/.

blay_paul April 18, 2010 April 18, 2010 at 6:05:52 PM UTC

flag

Report

link

Lînko payîdar

That should do nicely. Thanks.

The other thing is that I'd like to do a complete check and revamp of the index field. To be certain of not losing any data I'd need you to lock the index field so I can download / fix up / upload without any changes happening on your side.

I'm still working on things so I probably won't be ready for a week or so.

cewaban binimne cewaban bimojne

JimBreen April 19, 2010 April 19, 2010 at 7:48:06 AM UTC

flag

Report

link

Lînko payîdar

Can you keep me in the loop. I change the odd index when making a correction to a Japanese sentence. Also, when I do the weekly download I run it though a utility that checks that the index and sentence agree. That way I can detect when others have changed a sentence. I usually have to update the index and occasionally add to the list of names to be ignored (e.g. ムーリエル and 赤ずきん this week.)

In addition, I have a list of words from Collin McCulley which had mismatches between the index and dictionary. I have cleaned up most of them, but still have ~100. We need some way of tracking when they get out of kilter, mainly when dictionary entries need qualifying.

TRANG April 19, 2010 April 19, 2010 at 7:14:59 PM UTC

flag

Report

link

Lînko payîdar

@Paul, yes, I can easily block the access to the indices to everyone.

JimBreen April 19, 2010 April 19, 2010 at 7:38:17 AM UTC

flag

Report

link

Lînko payîdar

I download the file Trang sets up once a week. I check it over then set it up in the WWWJDIC system. At that stage the "examples.gz" file, etc. is rebuilt.

JimBreen April 19, 2010 April 19, 2010 at 7:59:36 AM UTC

flag

Report

link

Lînko payîdar

I missed the bit about the ID numbers. I use the English sentence number because 90%+ of the corrections coming from WWWJDIC users are to the English sentence, so it makes sense for WWWJDIC to link there.

As discussed on another forum, I could put in both - e.g.#ID=375963_12345. I have a major change half done in WWWJDIC which is blocking other changes. Once it is clear (maybe a week or so) I can make that sentence number change. I'll enable WWWJDIC users to select whether they want to link to the Japanese or English.

blay_paul April 18, 2010 April 18, 2010 at 9:22:46 PM UTC

flag

Report

link

Lînko payîdar

Example export system.

Japanese sentences with multiple English translations are (sometimes?) being exported with both versions.

For example:

4924 73899 「これが探していたものだ」と彼は叫んだ。 "This is what I was looking for," he exclaimed. 此れ[01]{これ} が探す{探していた} 物|1(もの)[01]{もの} だと|1 彼|2(かれ)[01] は|1[01] 叫ぶ{叫んだ}
4924 1513 「これが探していたものだ」と彼は叫んだ。 "This is what I was looking for!" he exclaimed. 此れ[01]{これ} が探す{探していた} 物|1(もの)[01]{もの} だと|1 彼|2(かれ)[01] は|1[01] 叫ぶ{叫んだ}

On the sentence annotations page only 73899 is given as the 'meaning' field with the Index field. So either a) The meaning field in the sentence annotations page isn't used, or b) There can be two or more 'meaning' field values, but only one is shown on the sentence annotations page.

cewaban binimne cewaban bimojne

JimBreen April 19, 2010 April 19, 2010 at 7:52:34 AM UTC

flag

Report

link

Lînko payîdar

I've noticed that, and I have assumed it wasn't used. When I notice cases such as that one, I have changed the English so they become identical, and hope it will lead to the removal of one of them

cewaban binimne cewaban bimojne

blay_paul April 19, 2010 April 19, 2010 at 9:13:38 AM UTC

flag

Report

link

Lînko payîdar

That's OK for WWWJDIC, and for cases where one is mistaken or they are both very similar.

It's not good 'Tatoeba practice' though.

cewaban binimne cewaban bimojne

JimBreen April 19, 2010 April 19, 2010 at 9:27:20 AM UTC

flag

Report

link

Lînko payîdar

In the case you quoted I thought it *was* the Tatoeba practice. They only differ by an exclamation mark.

Where they differ in more substantial ways, e.g. choice of personal pronoun where the Japanese has none, I guess there is a case for both being kept, and that's a situation where an index is best tied to a sentence-pair rather than just to the Japanese.

Tying to sentence pairs has another problem. The number of sentences in German and French is getting to the stage where it would be nice to include them in WWWJDIC. I'd be looking to having "examples_de" and examples_fr" extracts along with indices.

cewaban binimne cewaban bimojne

blay_paul April 19, 2010 April 19, 2010 at 9:38:16 AM UTC

flag

Report

link

Lînko payîdar

> In the case you quoted I thought it *was* the Tatoeba
> practice. They only differ by an exclamation mark.

The case I quoted, yes. There are, however, at least a handful where both alternatives are valid, significantly different, and illustrate something about the English language.

blay_paul April 18, 2010 April 18, 2010 at 8:22:37 PM UTC

flag

Report

link

Lînko payîdar

.csv format in downloads.

Just a note for those using the downloads. You use \ as the escape character.

This is a line from your csv file:

"4923";"1512";"「信用して」と彼は言った。";"\"Trust me,\" he said.";"信用為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}"

This is how it appears when loaded into Excel.

4923 1512 「信用して」と彼は言った。 \Trust me,\" he said." 信用為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}

Excel uses double " marks when escaping quotes. The same line in csv for Excel would be...

"4923";"1512";"「信用して」と彼は言った。";"""Trust me,"" he said.";"信用為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}"

Which imports to Excel as follows:

4923 1512 「信用して」と彼は言った。 "Trust me," he said. 信用為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}

I think the 'escaping with extra quote mark' may be the more standard version ...

kellenparker April 18, 2010 April 18, 2010 at 3:29:03 PM UTC

flag

Report

link

Lînko payîdar

Right. So. I'm not 13 years old. It was an honest mistake. Here's the problem:

I was wondering if Tatoeba had any sort of resistance to profanity. I thought something like "damnit" would be a common enough thing. So I MEANT to SEARCH for "damn". Turns out I added it as a sentence instead. Same for "fuck" because it took me two tries to realise I was using the wrong text box.

So those can be deleted outright. I didn't see any way to do it so I abandoned the sentences instead in case there is such a way and someone else wants to adopt them to get them deleted. 380292 and 380290.

Apologies.

cewaban binimne cewaban bimojne

TRANG April 18, 2010 April 18, 2010 at 3:40:42 PM UTC

flag

Report

link

Lînko payîdar

Hahaha, it's fine. It's alos our mistake, it means we need to change the form to make it clearer that it adds a new sentence.

I will delete your entries. There's currently no way for users to delete sentences, only admins can. The only solution when you want to "delete" a sentence is to replace it by a sentence that you actually want to keep.

As for profanity, we don't have anything against it, but we'd rather avoid it until we set up a mechanism to filter out sentences that are "not safe" for kids.

cewaban binimne cewaban bimojne

kellenparker April 18, 2010 April 18, 2010 at 3:44:33 PM UTC

flag

Report

link

Lînko payîdar

Good to know. And actually it's still pretty much entirely my fault. I searched at the top but then I think I stopped paying attention so when it sent me to the page saying "Nope, but you can add a sentence:", I thought I was searching again. It's not the form's issue. It's my attention span's issue.

sysko April 18, 2010 April 18, 2010 at 4:48:26 PM UTC

flag

Report

link

Lînko payîdar

And for profanities , we have some "colorful" sentences (spoiler : "search XXX in the search engine")

sysko April 18, 2010 April 18, 2010 at 2:39:06 PM UTC

flag

Report

link

Lînko payîdar

stemming should be working again for most languages when using the search engine
i.e search "think" should also return "thinking" "thought" etc. same for French / Spanish / Italian / Russian etc.

by the way it will not work with Ukrainian but I was wondering if using the russian stemmer will produce "better than nothing" result ? Demetrius, Dorenda ?
still looking for Arabic and georgian stemmers

cewaban binimne cewaban bimojne

Dorenda April 18, 2010 April 18, 2010 at 3:02:30 PM UTC

flag

Report

link

Lînko payîdar

Probably it will. But maybe there is a way to adapt the Russian stemmer into a Ukrainian one (or at least something more fit to Ukrainian)? I have no idea how those things work or how much work it would be, but if it's feasible, I could help with that.

cewaban binimne cewaban bimojne

sysko April 18, 2010 April 18, 2010 at 3:19:34 PM UTC

flag

Report

link

Lînko payîdar

globally how the stemmer works for russian is explained here http://snowball.tartarus.org/al...n/stemmer.html , I admit I haven't read it entirely, as I've no notion in Russian (and moreover they provided something which work out of the box for this).

So I dunno how "easy' it is to adapt this to Ukrainian.

cewaban binimne cewaban bimojne

Dorenda April 18, 2010 April 18, 2010 at 5:08:13 PM UTC

flag

Report

link

Lînko payîdar

It looks doable. I'd just have to adapt it to the Ukrainian alphabet, change the endings into their Ukrainian counterparts and add/remove some endings that either of the two languages doesn't have.

So I'd have to just change that piece of script on the blue background, right?

cewaban binimne cewaban bimojne

sysko April 18, 2010 April 18, 2010 at 10:55:00 PM UTC

flag

Report

link

Lînko payîdar

yep this one http://snowball.tartarus.org/al...em_Unicode.sbl to be more precise :) thanks

cewaban binimne cewaban bimojne

Dorenda April 24, 2010 April 24, 2010 at 3:55:48 PM UTC

flag

Report

link

Lînko payîdar

Okay, I adapted it. The results won't always be right, though, cause sometimes it's just not possible to see from the form of a word what type of word it is and thus what belongs to the ending. For example, "koromyslo" is a noun, so only "o" should be removed, but the script will think it's a past tense verb and remove "lo". I tried to choose the least bad options...
Anyway, is there some way to test it? And where should I send it?

And one more question. How can I make the thing also remove the superlative prefix '{n}{a}{i'}' from the beginning of words?

cewaban binimne cewaban bimojne

sysko April 24, 2010 April 24, 2010 at 4:43:31 PM UTC

flag

Report

link

Lînko payîdar

send us the file to our email address team [at] tatoeba [dot] org, and i will see how to integrate it.
to be honnest i don't really how it works (A) at least I will contact the guys of this project to see what can we do:),
but it's already great if you have adapted it to Ukrainian

Dês (mewzûyêk)

Tîpî

Babelball

TATAR1

LeviHighway

AlanF_US

LeviHighway

LeviHighway

gillux

gillux

gillux

Babelball

Îhtîyacê şima bi hetkarî esto?

Xurtkerdoxî

Derheq