clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search

Wall (5,665 threads)

Tips

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

Latest messages subdirectory_arrow_right

Pfirsichbaeumchen

4 hours ago

subdirectory_arrow_right

Luiaard

4 hours ago

subdirectory_arrow_right

AlanF_US

7 hours ago

subdirectory_arrow_right

rumpelstilzchen

8 hours ago

subdirectory_arrow_right

Thanuir

8 hours ago

subdirectory_arrow_right

AlanF_US

9 hours ago

subdirectory_arrow_right

AlanF_US

12 hours ago

feedback

Luiaard

12 hours ago

subdirectory_arrow_right

Ricardo14

13 hours ago

subdirectory_arrow_right

deniko

14 hours ago

blay_paul blay_paul April 27, 2010 at 8:25 PM April 27, 2010 at 8:25 PM link permalink

Quick fix idea.

This should be a relatively easy change, and it should make my (and Jim's) life easier.

1. Setting the 'meaning' field for a Japanese sentence automatically links that sentence to the English sentence identified.

2. On a standard sentence display of a Japanese sentence the link to the sentence identified in the meaning field looks different to the rest. (i.e. Red arrow instead of green, or something).

I also suggest that a meaning field entry of zero (0) could be used to identify Japanese sentences that are intentionally not to be used with WWWJDIC.

blay_paul blay_paul April 27, 2010 at 12:24 PM April 27, 2010 at 12:24 PM link permalink

Duplicate removal script.

I don't know how the script works exactly, but I think it may be missing a step.

Suppose we have

100000 Hello.
100001 こんにちは。
100002 Hi.

100001 is linked to 100000
100001 has the meaning field of 100000

Now, suppose someone decides that 'Hello' and 'Hi' are close enough to not need both.

100000 Hello.
100001 こんにちは。
100002 Hi. ---> Hello.

Then suppose the script removes 100000.

100001 こんにちは。
100002 Hello.

Is 100001 still linked to 100000? It should be linked to the duplicate 100002 instead.

Does 100001 still have the meaning field of 100000? It should have the meaning field of 100002 instead.

In other words is Sentence A is removed as a duplicate of Sentence B then all the links that pointed to Sentence A should now point to Sentence B instead.

{{vm.hiddenReplies[627] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 27, 2010 at 1:08 PM April 27, 2010 at 1:08 PM link permalink

the remove duplicate script does the following

identify all the sentence which have both the same language and the same text
and after it will keep the oldest sentence which are owned by someone (or the oldest one if none of the duplicate belongs to someone) and then will relink all links to the duplicate to this one
(so comments / translations / lists etc... etc.. )
and finally will remove the duplicate and keep only one
so the script will not produce any broken reference to a removed sentences

{{vm.hiddenReplies[628] ? 'expand_more' : 'expand_less'}} hide replies show replies
Dorenda Dorenda April 27, 2010 at 2:10 PM April 27, 2010 at 2:10 PM link permalink

> identify all the sentence which have both the same language and the same text

So it also merges duplicates that are not linked whatsoever?

{{vm.hiddenReplies[631] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 27, 2010 at 2:15 PM April 27, 2010 at 2:15 PM link permalink

Yep, that way even if new comers add
I love you
and translate it,

as I love you already exist, the script will delete the new "I love you" and link the translation to the old "I love you" (or also removed it, if the translation already exists too)

blay_paul blay_paul April 27, 2010 at 1:22 PM April 27, 2010 at 1:22 PM link permalink

> so the script will not produce any broken reference to
> a removed sentences

There are, however, some broken references being produced. It's not clear how though.

236727 あなたには姉妹がいますか。
was linked to 71123, which now no longer exists.
69566 Do you have any sisters?
does exist and was indirectly linked from 236727.

I don't know when 71123 was removed, why it was removed, or how it was removed, but something obviously went wrong somewhere. (It was one of the \N records last week - so it obviously isn't a recent deletion)

Hopefully these broken links are left over from earlier times and won't be reoccurring.

{{vm.hiddenReplies[629] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 27, 2010 at 1:25 PM April 27, 2010 at 1:25 PM link permalink

ok at least the remove duplicate script will not produce anymore broken links

saeb saeb April 26, 2010 at 1:25 AM April 26, 2010 at 1:25 AM link permalink

I just gotta say I love the modified japanese readings.
そこで 彼女[かのじょ] に 会お[あお] う と は 思い[おもい] も かけ なかっ た 。
I think I'm getting passive practice just being on tatoeba :)

{{vm.hiddenReplies[596] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb April 26, 2010 at 7:09 PM April 26, 2010 at 7:09 PM link permalink

who shall I thank for this delicacy?

{{vm.hiddenReplies[609] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 26, 2010 at 8:11 PM April 26, 2010 at 8:11 PM link permalink

Biptaste but he's not really often here ^^

{{vm.hiddenReplies[610] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb April 26, 2010 at 8:43 PM April 26, 2010 at 8:43 PM link permalink

btw sysko, any ideas on how to compile eclectus in a windows (7) environment...the wiki doesn't help at all and I have no experience with python (but some with c,c++,java,vb)

{{vm.hiddenReplies[612] ? 'expand_more' : 'expand_less'}} hide replies show replies
cburgmer cburgmer April 27, 2010 at 9:35 PM April 27, 2010 at 9:35 PM link permalink

Hi saeb, as sysko said, the current dependecy on KDE (more specific pykde) makes it impossible to run on Windows. I already ported the whole lot to Qt only (read: it runs on Windows :) but didn't commit anything to the repository right now. I'll be at it pretty soon, why don't you send me an email, I'll be happy for a beta-tester.

I'm OK with ppl emailing me, in fact I hardly get any mails asking for assistance but I need to visit walls on other projects to help out :)

{{vm.hiddenReplies[634] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb April 28, 2010 at 1:24 AM April 28, 2010 at 1:24 AM link permalink

thx a lot for the reply :)

beta-tester? count me in! I'll give it a go whenever it's up n ready ^^

TRANG TRANG April 26, 2010 at 8:52 PM April 26, 2010 at 8:52 PM link permalink

Oh saeb, saeb. "Windows" is a word that should NEVER, EVER be said in front of sysko... But you didn't know so he will forgive you.

He won't help you to compile eclectus on Windows Seven. However, he gladly will help you install a Linux distribution and compile eclectus on it.

{{vm.hiddenReplies[614] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb April 26, 2010 at 9:08 PM April 26, 2010 at 9:08 PM link permalink

linux...you guys are gonna make me change my major soon enough :PP...must..take a break..from tatoeba

{{vm.hiddenReplies[616] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 26, 2010 at 9:46 PM April 26, 2010 at 9:46 PM link permalink

You know, you don't need to major in Computer Science to use Linux ^^

Also, how could you ever take a break from Tatoeba knowing that Portuguese is ahead of Arabic? :O
This is not the way to go. You have to fight harder!

{{vm.hiddenReplies[618] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb April 26, 2010 at 10:20 PM April 26, 2010 at 10:20 PM link permalink

5 hrs integrated exams are not cool at all I swear xD. Not even spanish will be safe once I get my break, mark my words :PP

sysko sysko April 26, 2010 at 8:48 PM April 26, 2010 at 8:48 PM link permalink

python is an interpreted language so you don't need to compile it,
but i think it will not work for the moment on OS other than linux as eclectus as still some dependencies with KDE :( . but after cburgmer knows better than me, because at least he's eclectus author :p

{{vm.hiddenReplies[613] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb April 26, 2010 at 9:06 PM April 26, 2010 at 9:06 PM link permalink

>Oh saeb, saeb. "Windows" is a word that should NEVER, EVER be said in front of sysko
oh somebody shoot me (apologies to all the cute babies who died in this incident :P)

>However, he gladly will help you install a Linux distribution and compile eclectus on it.

is that right sysko? anytime when u're ready (I just hope this doesn't take the whole weekend :P)

{{vm.hiddenReplies[615] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 26, 2010 at 9:33 PM April 26, 2010 at 9:33 PM link permalink

> is that right sysko? anytime when u're ready (I just hope this doesn't take the whole weekend :P)

I forgot to mention, "in a very distant future" :P Not that I want to prevent you from using eclectus saeb, but we (including sysko) actually have a lot of work at the moment :)

As sysko said, you can always contact the author of eclectus if you really want to use it (he's in Tatoeba under the username of cburgmer).

Also, note that eclectus is not something you compile, since Python is an interpreted language (I hadn't paid attention that it was in Python).

PS: If you are not a major in Computer Science, and do not know what the difference is between compiled and interpreted, Google is your friend :)

=> http://www.google.com/search?so...d+and+compiled

{{vm.hiddenReplies[617] ? 'expand_more' : 'expand_less'}} hide replies show replies
saeb saeb April 26, 2010 at 10:10 PM April 26, 2010 at 10:10 PM link permalink

I know..."use common sense", "read the manual", "do your research", "less talk more work". got it :) *opens anatomy book*

[I know you guys are very busy :P...I just can't stop messing with you frenchmen...always so serious (oh I'm so getting slaps on the face in my inbox aren't I?)]

saeb saeb April 26, 2010 at 8:29 PM April 26, 2010 at 8:29 PM link permalink

would be really nice if you could pass it on ^^ , tell'm he got fans on tatoeba :D

blay_paul blay_paul April 26, 2010 at 2:23 PM April 26, 2010 at 2:23 PM link permalink

Improved Handling For Large Text Entries.
(Long term wish list item)

This would need ...

a) A data structure that contains multiple sentence entries.

b) A page display that shows the whole text (or a page of it) with the sentences together. Also needs ability to handle some formatting (paragraph breaks at the least).

c) Guidelines on entering, editing and translating multi-sentence blocks of text. (For example sentences should not be translated out of context; if possible the whole block should be translated not just bits of it; etc.)

blay_paul blay_paul April 25, 2010 at 8:23 PM April 25, 2010 at 8:23 PM link permalink

Work in progress: Update.

More to do than I anticipated - probably another 4 days or so to finish this round.

(NOTE to self) Consider adding 二三日 to WWWJDIC

Demetrius Demetrius April 17, 2010 at 11:48 AM April 17, 2010 at 11:48 AM link permalink

I feel like adding a bit of Belarusian sentences. Could you please add it?

I’ll add sentences in Official Belarusian, so I suggest marking Belarusian with the current flag of Belarus.

Those who use Classical Belarusian don’t like that flag anyway. :)

{{vm.hiddenReplies[512] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 25, 2010 at 12:38 PM April 25, 2010 at 12:38 PM link permalink

Belarusian added :)

Could you tell us more about the differences between "Official Belarusian" and "Classical Belarusian"? Is there (will there be) a need to support both someday? Should they be two distinct "languages"? Or is it more like the case of Chinese (traditional vs. simplified)? Is it possible to convert automatically between official to classical Belarusian?

{{vm.hiddenReplies[587] ? 'expand_more' : 'expand_less'}} hide replies show replies
Demetrius Demetrius April 26, 2010 at 9:37 AM April 26, 2010 at 9:37 AM link permalink

It is called ‘Classical Orthography’, but in fact it is a different language standard. The two forms of Belarusian stemmed from a Soviet language reform that has (arguably) brought Belarusian closer to Russian (on the other hand, academic Belarusian is, arguably, closer to Polish).

Official Belarusian is taught at schools, but those who use Belarusian every day often prefer Classical one. Classical is rather widespread in the Internet. Laws accompanying a new (minor) reform of the official Belarusian in 2007 also have in fact banned the use Classical Belarusian in press.

Apart from merely orthographical differences (сьнег/śnieh vh снег/snieh; робісся/robiśsia vs робішся/robišsia), there are lexical and grammatical ones.

A big problem is the transcription of loanwords. They aren’t simply written differenlty, they are pronounced differently. In Official Belarusian they are (somewhat inconsistently) borrowed from Russian or using similar transcription system, whilst in Classical they are borrowed from West-European languages directly: метр/mietr vs мэтр/metr for meter, опера/opiera vs опэра/opera, сымбаль/symbal vs сімвал/simvał, Атэны/Athens vs Афіны/Afiny.

There are also some grammatical forms acceptable in Classical Belarusian but considered dialectal in official (synthetic future tense: рабіцьму/rabićmu vs. буду рабіць/budu rabić), and different tendencies towards forming some forms (Gen. pl. of мова can be моў or моваў; classical prefers the latter whilst official the former).

There are also some words words considered Russisms/Polonisms in one variant and widespread in the other (працэнт/pracent vs адсотак/adsotak; цячэнне/ciačeńnie vs плынь/plyń).

Automatic conversion is possible, but there is no aviable open-source software to do this. The only one I know is “Litara”, a plugin for MS Word 2000 (http://pravapis.tut.by/), which is closed-source and is unlikely to be ported to new versions of Word because of Microsoft’s new policy (now it’s necessary to obtain permission from MS).

Wikipedia has 2 versions (be and be-x-old).

{{vm.hiddenReplies[597] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 26, 2010 at 6:44 PM April 26, 2010 at 6:44 PM link permalink

Thanks for the explanation!

Dorenda Dorenda April 25, 2010 at 4:23 PM April 25, 2010 at 4:23 PM link permalink

As far as I know, it's mostly a matter of spelling, but I'll let it to Demetrius to tell you more about it. :)

What I wanted to say: Belarusian is not displayed in the list of numbers of words in each language on the main page.

{{vm.hiddenReplies[592] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 25, 2010 at 6:26 PM April 25, 2010 at 6:26 PM link permalink

now it should be displayed :)

{{vm.hiddenReplies[593] ? 'expand_more' : 'expand_less'}} hide replies show replies
Dorenda Dorenda April 25, 2010 at 6:59 PM April 25, 2010 at 6:59 PM link permalink

Yep. Yay, 2 sentences already! :p

sysko sysko April 18, 2010 at 12:22 AM April 18, 2010 at 12:22 AM link permalink

No problem, but it will be added next week (the change of server makes us a bit busy ^^)

{{vm.hiddenReplies[516] ? 'expand_more' : 'expand_less'}} hide replies show replies
Pharamp Pharamp April 18, 2010 at 10:19 AM April 18, 2010 at 10:19 AM link permalink

I saw also that Icelandic isn't listed :(
I don't think I'm really good at it for making ten sentences, but I can try :)

{{vm.hiddenReplies[518] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift April 25, 2010 at 2:15 PM April 25, 2010 at 2:15 PM link permalink

I was actually contemplating requesting for Icelandic...

{{vm.hiddenReplies[588] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 25, 2010 at 2:37 PM April 25, 2010 at 2:37 PM link permalink

and now your dream comes true :) Iceland is officialy supported by Tatoeba :)
have fun ;-)

{{vm.hiddenReplies[589] ? 'expand_more' : 'expand_less'}} hide replies show replies
Swift Swift April 25, 2010 at 2:40 PM April 25, 2010 at 2:40 PM link permalink

That was fast...!

Dorenda Dorenda April 17, 2010 at 11:07 PM April 17, 2010 at 11:07 PM link permalink

Good idea. I was hoping you were going to add some Belarusian. :)

blay_paul blay_paul April 22, 2010 at 12:55 PM April 22, 2010 at 12:55 PM link permalink

> http://tatoeba.org/app/webroot/files/downloads/
> [Updated] Once a week. On Saturdays around 9AM France time.

If you get this message, could you do one early? :-)

I've made a quite a lot of changes on the Tatoeba side and I want to see if they came out OK.

Also could you put a block on the Japanese index field from when you update the download files next till you import an updated version I send you?

{{vm.hiddenReplies[554] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 24, 2010 at 11:05 PM April 24, 2010 at 11:05 PM link permalink

Actually the exports are going to be delayed to Sunday evening. Didn't have time to work on that today ^^'

{{vm.hiddenReplies[586] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 26, 2010 at 1:21 PM April 26, 2010 at 1:21 PM link permalink

Is the 25 April "wwwjdic.csv" OK to use? I'll download and check it, but I won't install until you or Paul say it's clear.

{{vm.hiddenReplies[598] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 26, 2010 at 1:32 PM April 26, 2010 at 1:32 PM link permalink

Yes please check it. I did the export yesterday but I was too tired to notify you about it.

I changed the way the data is retrieved, accordingly to the email I had sent you a few weeks ago (i.e. I'm using the meaning_id now).

Also, the double quotes are now escaped with a double quote, like Paul explained here:
http://tatoeba.org/eng/wall/sho...33#message_533

If there is any problem, let me know.

{{vm.hiddenReplies[600] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 26, 2010 at 1:53 PM April 26, 2010 at 1:53 PM link permalink

Well, my conversion blew up. It wasn't just the double quotes you changed - you dropped the quotes around the sentence numbers.

E.g. it used to be:

"74008";"329712";"この天気とは気長に付き合っていくしかない。";"You have.....

Now it is:

74008;329712;"この天気とは気長に付き合っていくしかない。";"
You have....

Can you put them back? I was relying on the ";" as a field separator.

{{vm.hiddenReplies[601] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 26, 2010 at 6:37 PM April 26, 2010 at 6:37 PM link permalink

Alright, I re-exported the WWWJDIC file with the quotes around the ids.

If you prefer the fields to be separated with a TAB, we can also do that in the future.

{{vm.hiddenReplies[607] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 27, 2010 at 12:51 AM April 27, 2010 at 12:51 AM link permalink

I'm happy with it the way it is.

{{vm.hiddenReplies[622] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 27, 2010 at 1:25 AM April 27, 2010 at 1:25 AM link permalink

OK, some feedback on the latest:

(a) you also changed the way of signalling empty
sentences from \N to N.

(b) in a bizarre twist, sentence 249593 has been delivered with the Dutch sentence instead of the Japanese one!

A: Op onze website, http://www.example.com, staat alle informatie die je nodig hebt. Our Web site, http://www.example.com will tell you all you need to know.#ID=249593
B: 私ども 乃{の} ウェブサイト~ は 貴方(あなた)[01]{あなた} に 必要[01]{必要な} 情報 を 全て 伝える{お伝え} 為る(する)[10]{します}

Looking at these in the database I can't see a reason why it happened.

(c) the 41 with missing sentences are interesting. Most (36) have Japanese and indices but no English. Presumably they had English once. Five have English and the Japanese indices, but no Japanese sentence. E.g.

"205059";"42301";"N;"That's all right.";"其れ[01]{それ} は|1 申し分 無い{ない}

205059 doesn't exist. 42301 is there, but it is linked to 205057. This seems to be a case of broken links, perhaps during deletion of duplicates.

Does anyone want to see the list of the 41?

(d) is it OK to amend some indices. About 5 don't currently match the sentence text.

{{vm.hiddenReplies[623] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 27, 2010 at 4:47 AM April 27, 2010 at 4:47 AM link permalink

Not Trang, and commenting at 3am - so take with large helping of salt.

(a) Not nice, but there should be very few empty sentences (or none). This ties in with another point, so wait till the end of the post ;-)

(b) Bizarre twist is bizarre. I know why it happened though. Look at the log on http://tatoeba.org/jpn/sentences/show/164914 Jeroen changed the Japanese sentence to a Dutch sentence (don't know why). This left the index data intact.

(c) Yeah, I've mentioned that deleting Japanese sentences should also delete associated index data.

> Does anyone want to see the list of the 41?

I do.

(d) As long as you post the Japanese sentence ID for those sentences here.

Now the point I promised earlier. I think the code / SQL used to generate wwwjdic.csv needs to be changed in the following way:

* Include all sentences marked Japanese that have associated index data.
* Link to the English sentence (singular) given by the 'meaning' field in the index data.

I think that would eliminate many of the empty sentences and multiple outputs of the current version. NOTE that I'm currently around 3/4 finished doing a first check-up of the jpn_index.csv data so I would recommend Trang wait until I've sent that back in before doing any re-export.

{{vm.hiddenReplies[624] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 27, 2010 at 8:14 AM April 27, 2010 at 8:14 AM link permalink

(d) OK

85518 - dropped a trailing よ
151863 - 私達 -> 我々
152372 - 肉屋 -> 肉 や
165426 - added {町はずれ}
197934 - removed {はな} since you changed it to 鼻

TRANG TRANG April 26, 2010 at 2:05 PM April 26, 2010 at 2:05 PM link permalink

Ah yes, that too. I took them out because sysko told me it was not CSV compliant to have quotes around numbers, which makes sense because quotes indicate "this is a string" (vs. "this is an int").

But of course, if it's easier for you to have the quotes, I can put them back (when I get back home).

{{vm.hiddenReplies[602] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 26, 2010 at 2:14 PM April 26, 2010 at 2:14 PM link permalink

Yes, please.

It would make it a LOT simpler if there was a consistent inter-field sequence. Having ; alone gives me a problem as they occur often in the middle of English sentences. (I don't care whether a field is numeric or not - they are all just strings to me.)

{{vm.hiddenReplies[603] ? 'expand_more' : 'expand_less'}} hide replies show replies
sysko sysko April 26, 2010 at 2:17 PM April 26, 2010 at 2:17 PM link permalink

so TAB wouldn't be easier to parse?

{{vm.hiddenReplies[604] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 27, 2010 at 12:51 AM April 27, 2010 at 12:51 AM link permalink

TAB is fine. In fact I convert ";" to TAB as a first step.

{{vm.hiddenReplies[621] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 27, 2010 at 4:49 AM April 27, 2010 at 4:49 AM link permalink

TAB is fine with me too (not that that matters much ;-)

blay_paul blay_paul April 26, 2010 at 2:18 PM April 26, 2010 at 2:18 PM link permalink

The theory is that no ; inside text field delimiters is to be counted as a field separator in .csv format. Having said that I don't care at all whether the ID numbers are handled as 'field' or 'text' and I'm sure nobody else would have trouble converting field type if required.

I did find the \" notation fiddly though, so I'm glad that's gone.

blay_paul blay_paul April 26, 2010 at 1:31 PM April 26, 2010 at 1:31 PM link permalink

You can go ahead. It's the index (and 'meaning' field) that I'm working on so just bear in mind that any changes you make to the index data will be overwritten later (unless you let me know about them).

TRANG TRANG April 22, 2010 at 11:03 PM April 22, 2010 at 11:03 PM link permalink

Okay, I've updated the jpn_indices.csv. If you needed also the wwwjdic.csv file to be updated, you'll have to wait until Saturday...

I forgot we still haven't finished configuring everything on the new server, I can't export into CSV from there yet. I'd have to import the database into my local version, then do the export into CSV, then re-export if I wanted to update the CSV (which I did for the jpn_indices).

Well, as a result, the usual 9AM Saturday exports might be delayed to the evening.

{{vm.hiddenReplies[563] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 22, 2010 at 11:09 PM April 22, 2010 at 11:09 PM link permalink

No worries. Most of what I'm doing only needs the jpn_indices.csv itself, the rest can probably wait until Saturday. I'm off to sleep myself now, so I'll give you a progress report at the end of tomorrow or something. :-)

TRANG TRANG April 22, 2010 at 6:58 PM April 22, 2010 at 6:58 PM link permalink

Yes, I'll do this some time before going to sleep (and I sleep early don't worry).

CK CK April 24, 2010 at 4:10 AM, edited October 25, 2019 at 8:01 AM April 24, 2010 at 4:10 AM, edited October 25, 2019 at 8:01 AM link permalink

[not needed anymore- removed by CK]

{{vm.hiddenReplies[579] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG April 24, 2010 at 9:48 AM April 24, 2010 at 9:48 AM link permalink

It's interesting you're suggesting this now, because I'll be meeting my team in a couple of hours and this is one thing we are going to discuss :)

My current idea, instead of using votes, is to use the adoption system. There aren't enough active members for a vote system to run smoothly, so it is not worth yet being implemented.

However, we can already have a quick first stage of proof-reading using the features we already have, that is, by encouraging users to adopt "orphan sentences" (those that don't belong to anyone). While adopting, of course, you would correct mistakes if there are any. But you can simply adopt in a way to say "I've seen this sentence, it's correct".

To make the whole process quicker, I was thinking of using the lists. I could generate lists of 200 sentences and add orphan sentences to this list. These sentences would be automatically attributed to a certain user while being added to the list. This user then can simply read the sentences and remove them from the list as they are being checked/corrected.

This would require very little coding, so it would be a good solution until we get to integrate a "vote" system of any kind.

{{vm.hiddenReplies[580] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 25, 2010 at 3:46 PM April 25, 2010 at 3:46 PM link permalink

I, also, don't think there are enough members for a vote system to work. However I do think that there should be a way for new sentences and changed sentences to be 'validated'.

When changing (or adding) a sentence it is marked 'unvalidated' until someone (other than the person who changed it) confirms the change as valid. There should also be a way to view "Unvalidated sentences (in language X)".

It would also be nice if recently validated sentences (say from the last week) in a certain language could be viewed.

TRANG TRANG April 24, 2010 at 9:49 AM April 24, 2010 at 9:49 AM link permalink

> I love what you are doing with this website

Thanks =) We love what we are doing as well ;)

blay_paul blay_paul April 23, 2010 at 9:29 PM April 23, 2010 at 9:29 PM link permalink

Work in progress

There are currently lots of index records that need altering because words have been added / removed / changed in Edict.

For example
* headwords that were unique may no longer be unique (so readings need to be added).
* Entries that were once two separate dictionary records may have been merged into one (so indexing needs to be changed to one keyword only as well)
* Adding, removing and merging words may leave the |1, |2, ... notations needing an update.
* There is also a great deal of general checking to do.

{{vm.hiddenReplies[570] ? 'expand_more' : 'expand_less'}} hide replies show replies
JimBreen JimBreen April 24, 2010 at 12:46 AM April 24, 2010 at 12:46 AM link permalink

The first two of these can be checked automatically, although it's not a small task. Re the "|1, |2, ... notations", are they still needed?

{{vm.hiddenReplies[572] ? 'expand_more' : 'expand_less'}} hide replies show replies
blay_paul blay_paul April 24, 2010 at 3:46 AM April 24, 2010 at 3:46 AM link permalink

> The first two of these can be checked automatically,
> although it's not a small task.

Probably best not done automatically. Checking like that always brings up a bunch of things to check/correct in Edict. (c.f. today's, and yesterday's submissions ;-)

It also familiarises me with the words/readings in use in the examples (all 28,000 odd of them :-P

> Re the "|1, |2, ... notations", are they still needed?
I still need them (or something equivalent). For sanity checking, as much as anything. Perfectionists (and people using Microsoft products) still need them. (No comments about the intersect being a null set, please)

blay_paul blay_paul April 23, 2010 at 9:25 PM April 23, 2010 at 9:25 PM link permalink

Unique identification of a JMDICT entry.

This is technical stuff, only really of interest for people who want to deal with the 'index' field of Japanese sentences.

The obvious way to identify a dictionary entry as used in WWWJDIC + Edict is by the entry number (duh!). However that's not how it's done in the index field. Why? Because 2147630 is not 'human friendly' for whoever is creating and editing the index fields. (i.e. you can't look at 2147630 and know what word it refers to)

You could identify by the headword of the dictionary entry - あっという間に will only match one record. However there are around 3,000 dictionary entries where that will not be enough. 前(まえ) is not the same entry as 前(ぜん). So, for ambiguous kanji headwords, you include the reading of the word as well.

You've now reached the basic method used by Jim when he links WWWJDIC to example records. However that is not enough to identify 100% of the entries in JMdict uniquely. There are kana headwords like 'は' that are present in more than one entry.

はあ; は (int) (1) yes; indeed; well; (2) ha!; (3) what?; huh?; (4) sigh

は (prt) (1) (pronounced わ in modern Japanese) topic marker particle; (2) indicates contrast with another option (stated or unstated); (3) adds emphasis; (P)

Note that the second [EX] link in the はあ; は entry is actually for the particle は. There are a few like this that can only be fixed by re-doing the indexing system that Jim uses (in a way that would be more complicated and slower than he wants to deal with).

Now we come to the notation used in Tatoeba (and that I used 'at home' when maintaining the Tanaka example collection. Every headword that is not unique has the notation |1, |2, |3, added.

So instead of 前(さき) 前(ぜん) 前(まえ)
you might have 前|1(さき) 前|2(ぜん) 前|3(まえ)

Some points to note:
* The numbers are assigned in order by JMDict entry ID. So 前|1(さき) (Entry 1387210) comes before 前|2(ぜん) (Entry 1392570) and 前|3(まえ) (Entry 1392580)
* Because I used Access (and Excel) some characters are treated as 'equivalent' by Microsoft that are not actually identical. So ヽ|1 ヾ|2 ゝ|3 ゞ|4 and 々|5 all need to be distinguished by the numerical notation.
* The order of headwords / readings is significant in JMDict - most common headwords are supposed to come first, least common last. When indexing the preference is to use the first headword / reading when possible.

NOTE that the point of the index data is to uniquely identify a dictionary entry, NOT to reflect 100% accurately the dictionary form and reading of the word being indexed.

e.g. If you used 挙がります in a sentence the index would include
上がる{挙がります}
it would not include
挙がる{挙がります}

Apart from anything else this ensures that there is only one [EX] link from dictionary entries.

Finally the square brackets note the sense of the word
上がる[02] = Second sense of the word 上がる
and the curly brackets show the /exact match/ for the indexed word in the sentence.

僕は学校の成績が上がった。
僕|2(ぼく)[01] は|1 学校 乃{の} 成績 が 上がる{上がった}

*Psst* Trang - maybe this should be on a page somewhere?