blay_paul blay_paul 2010年4月18日 20:22 2010年4月18日 20:22 link permalink

.csv format in downloads.

Just a note for those using the downloads. You use \ as the escape character.

This is a line from your csv file:

"4923";"1512";"「信用して」と彼は言った。";"\"Trust me,\" he said.";"信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}"

This is how it appears when loaded into Excel.

4923 1512 「信用して」と彼は言った。 \Trust me,\" he said." 信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}

Excel uses double " marks when escaping quotes. The same line in csv for Excel would be...

"4923";"1512";"「信用して」と彼は言った。";"""Trust me,"" he said.";"信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}"

Which imports to Excel as follows:

4923 1512 「信用して」と彼は言った。 "Trust me," he said. 信用 為る|1(する){して} と|1 彼|2(かれ)[01] は|1 言う{言った}

I think the 'escaping with extra quote mark' may be the more standard version ...

kellenparker kellenparker 2010年4月18日 15:29 2010年4月18日 15:29 link permalink

Right. So. I'm not 13 years old. It was an honest mistake. Here's the problem:

I was wondering if Tatoeba had any sort of resistance to profanity. I thought something like "damnit" would be a common enough thing. So I MEANT to SEARCH for "damn". Turns out I added it as a sentence instead. Same for "fuck" because it took me two tries to realise I was using the wrong text box.

So those can be deleted outright. I didn't see any way to do it so I abandoned the sentences instead in case there is such a way and someone else wants to adopt them to get them deleted. 380292 and 380290.


TRANG TRANG 2010年4月18日 15:40 2010年4月18日 15:40 link permalink

Hahaha, it's fine. It's alos our mistake, it means we need to change the form to make it clearer that it adds a new sentence.

I will delete your entries. There's currently no way for users to delete sentences, only admins can. The only solution when you want to "delete" a sentence is to replace it by a sentence that you actually want to keep.

As for profanity, we don't have anything against it, but we'd rather avoid it until we set up a mechanism to filter out sentences that are "not safe" for kids.

kellenparker kellenparker 2010年4月18日 15:44 2010年4月18日 15:44 link permalink

Good to know. And actually it's still pretty much entirely my fault. I searched at the top but then I think I stopped paying attention so when it sent me to the page saying "Nope, but you can add a sentence:", I thought I was searching again. It's not the form's issue. It's my attention span's issue.

sysko sysko 2010年4月18日 16:48 2010年4月18日 16:48 link permalink

And for profanities , we have some "colorful" sentences (spoiler : "search XXX in the search engine")

sysko sysko 2010年4月18日 14:39 2010年4月18日 14:39 link permalink

stemming should be working again for most languages when using the search engine
i.e search "think" should also return "thinking" "thought" etc. same for French / Spanish / Italian / Russian etc.

by the way it will not work with Ukrainian but I was wondering if using the russian stemmer will produce "better than nothing" result ? Demetrius, Dorenda ?
still looking for Arabic and georgian stemmers

Dorenda Dorenda 2010年4月18日 15:02 2010年4月18日 15:02 link permalink

Probably it will. But maybe there is a way to adapt the Russian stemmer into a Ukrainian one (or at least something more fit to Ukrainian)? I have no idea how those things work or how much work it would be, but if it's feasible, I could help with that.

sysko sysko 2010年4月18日 15:19 2010年4月18日 15:19 link permalink

globally how the stemmer works for russian is explained here , I admit I haven't read it entirely, as I've no notion in Russian (and moreover they provided something which work out of the box for this).

So I dunno how "easy' it is to adapt this to Ukrainian.

Dorenda Dorenda 2010年4月18日 17:08 2010年4月18日 17:08 link permalink

It looks doable. I'd just have to adapt it to the Ukrainian alphabet, change the endings into their Ukrainian counterparts and add/remove some endings that either of the two languages doesn't have.

So I'd have to just change that piece of script on the blue background, right?

sysko sysko 2010年4月18日 22:55 2010年4月18日 22:55 link permalink

yep this one to be more precise :) thanks

Dorenda Dorenda 2010年4月24日 15:55 2010年4月24日 15:55 link permalink

Okay, I adapted it. The results won't always be right, though, cause sometimes it's just not possible to see from the form of a word what type of word it is and thus what belongs to the ending. For example, "koromyslo" is a noun, so only "o" should be removed, but the script will think it's a past tense verb and remove "lo". I tried to choose the least bad options...
Anyway, is there some way to test it? And where should I send it?

And one more question. How can I make the thing also remove the superlative prefix '{n}{a}{i'}' from the beginning of words?

sysko sysko 2010年4月24日 16:43 2010年4月24日 16:43 link permalink

send us the file to our email address team [at] tatoeba [dot] org, and i will see how to integrate it.
to be honnest i don't really how it works (A) at least I will contact the guys of this project to see what can we do:),
but it's already great if you have adapted it to Ukrainian

saeb saeb 2010年4月17日 20:48 2010年4月17日 20:48 link permalink

Congrats on the new server! I can already feel the site is 100x faster. oh and I'm in love with the new inbox, great update! are we cool or are we cool :)

Dorenda Dorenda 2010年4月17日 23:05 2010年4月17日 23:05 link permalink

You're cool. :)
It's so much faster, great! :D
(And I just loved that note we got while the site didn't work. :))

blay_paul blay_paul 2010年4月15日 19:32 2010年4月15日 19:32 link permalink

*psst* Trang (or sysko)

I need to replace
but the <space> doesn't seem to work for the 'replace with' field. (At least there aren't any spaces in the preview).

There are 87 instances that need to be replaced in the index field so I don't really want to do it manually.

TRANG TRANG 2010年4月15日 23:13 2010年4月15日 23:13 link permalink

You have to use an actual space in the "Replace" field, not the <space> tag :)

The reason why you have to type <space> in the "Search" is because trailing spaces are not taken into account in the search, for some reason. But the "Replace" field accepts trailing spaces (normally...).

blay_paul blay_paul 2010年4月16日 5:40 2010年4月16日 5:40 link permalink

> You have to use an actual space in the "Replace" field, not the <space> tag :)

I tried it both ways - no spaces in the preview display.

I've found what the problem is, though. The preview button only works ONCE. If the old preview is still displayed then it doesn't do anything when you click the preview button with a different string.

TRANG TRANG 2010年4月16日 10:51 2010年4月16日 10:51 link permalink

> The preview button only works ONCE.

Ah right, I forgot to warn you about this. The "preview" function may work more than once, but I have yet to figured out the conditions for it to work/not work a second time. In your case, I'm guessing it didn't work because of the < and >...

brauliobezerra brauliobezerra 2010年4月15日 15:46 2010年4月15日 15:46 link permalink

About names, can we translate them if there's an obvious correspondence? I'm talking about names like Peter, Mary, etc.

MUIRIEL MUIRIEL 2010年4月15日 17:23 2010年4月15日 17:23 link permalink

Good question. Personally, I never translate them, because I think Ann should be called Ann, and not Anne, no matter if she is in France or in the UK at the moment ;).
But I often see translations of names on Tatoeba...

JimBreen JimBreen 2010年4月16日 14:39 2010年4月16日 14:39 link permalink

Hmmm. So my younger sister should change her name from Anne to Ann?
My parents got it wrong? 8-)
Actually Anne is about as common as Ann among English-speaking people. Canonical spellings are a thing of the past. and we've always had Graham and Graeme, Roger and Rodger, etc.

MUIRIEL MUIRIEL 2010年4月16日 14:47 2010年4月16日 14:47 link permalink

That's not what I meant.
I just meant that I would call your sister like your parents call her and not translate her name in my language (or in any other language).

blay_paul blay_paul 2010年4月15日 17:27 2010年4月15日 17:27 link permalink

With Japanese you should 'transliterate' to katakana so Paul becomes ポール (for instance). When going from Japanese to English there are a number of variations to consider.

TRANG TRANG 2010年4月15日 18:09 2010年4月15日 18:09 link permalink

You can.

As far as I'm concerned, I have the same opinion as Muiriel. But we won't forbid translations of names. I don't see any good reason to forbid it anyway.

Dorenda Dorenda 2010年4月15日 19:02 2010年4月15日 19:02 link permalink

In general I agree that a person should be called by his/her own name, no matter where he/she is, but some languages have more of a tendency to translate names (as I read somewhere lately, when they speak about George Bush in Scottish Gaelic, they call him Seòras Bush, for example, while in Dutch we would (nowadays) just leave his name the way it is), so I think you should also consider how common it is for the language you're translating into to translate names or to use the foreign version.
And then there is the next problem... Suppose an English sentence about Peter has been translated into Russian by someone who decided to translate the name. So now we have a Peter and a Pyotr. If someone translated the Russian sentence into Ukrainian, it would look silly not to make it Petro, since that's how they do it: Ukrainians use different versions of their name depending on what language they are speaking. Now if I wanted to translate any of these sentences without translating names, I'd have to make three translations. Or I could just choose one of them and link my translations to all other three sentences, but it would be strange to have a Dutch sentence with Pyotr as a translations of an English sentence about Peter, for example. So I would choose a name that is common in Dutch: Peter, or maybe Pieter or even Petrus.

Long story, but what I wanted to say is: it all depends on the situation and the language you're translating into. :)

MUIRIEL MUIRIEL 2010年4月15日 19:12 2010年4月15日 19:12 link permalink

same example for the French^^: They pronounce George Bush as if it was a French name. Too strange for me as German - we would never call him Georg Busch :D.

saeb saeb 2010年4月15日 19:29 2010年4月15日 19:29 link permalink

oh god I would never translate peter into arabic. The arabic version sounds awful :P

blay_paul blay_paul 2010年4月14日 14:31 2010年4月14日 14:31 link permalink

Sentence Annotation page

Could you put up a "Changes saved" message on the page after you click the 'save' button? Otherwise it's easy to forget whether you've saved the work you've done or not.

JimBreen JimBreen 2010年4月15日 3:03 2010年4月15日 3:03 link permalink

I second that request. Also a log of changes would be really good.

TRANG TRANG 2010年4月15日 9:28 2010年4月15日 9:28 link permalink

> Could you put up a "Changes saved" message on the page after you click the 'save' button?

Yes, I'll take care of this after we have moved to our new server.

> Also a log of changes would be really good.

I'll try to do that for the end of the month.

blay_paul blay_paul 2010年4月10日 12:15 2010年4月10日 12:15 link permalink

MeCab dictionary usage.

I see that MeCab installs (by default) with IPADIC.

Looking at this page

it would seem that Unidic may give a superior result if it can be used. I plan to do a little experimentation to see if I can improve the parsing capabilities of MeCab from the default setup.

In this regard I would be grateful if someone could recommend a USER FRIENDLY free database that SUPPORTS JAPANESE CHARACTERS.

saeb saeb 2010年4月10日 16:41 2010年4月10日 16:41 link permalink


blay_paul blay_paul 2010年4月10日 19:18 2010年4月10日 19:18 link permalink

> SQLite?

Sounds familiar. Actually I installed that on my previous computer (although in the end different software suited me better for what I was working on then). I gave it a try again, but it's not user friendly enough for me (I'm from the graphical interface generation ;-)

MySQL Workbench + Server looks promising, I'm giving that a try now.

blay_paul blay_paul 2010年4月10日 22:21 2010年4月10日 22:21 link permalink

Disappointed in MySQL Workbench. It's /nearly/ there, but not quite. :-( I'm seriously considering buying Access 2007 now.

I could probably use Excel 2007 for some of it - but it really isn't a good idea to 'make pretend' that a spreadsheet is a database.

sysko sysko 2010年4月10日 23:16 2010年4月10日 23:16 link permalink

mysql + phpmyadmin ?

blay_paul blay_paul 2010年4月10日 23:46 2010年4月10日 23:46 link permalink

I think I'll probably start off with Excel 2007 (because I'm very familiar with it) then gradually migrate the content to MySQL. MySQL isn't bad but there are too many gaps in the thin GUI veneer provided by Workbench. Like having to resort to command line SQL stuff to import data from text. I miss the Access wizards for that sort of thing.

saeb saeb 2010年4月10日 16:47 2010年4月10日 16:47 link permalink

google spreadsheet?

blay_paul blay_paul 2010年4月10日 17:09 2010年4月10日 17:09 link permalink

> google spreadsheet?

Sounds rather too 'spreadsheety'. ;-)
I've got Excel 2007 for that. (I'd use Access 2007, but I couldn't afford the professional version of Office)

Now if anyone feels like donating it ... :-)

saeb saeb 2010年4月10日 17:36 2010年4月10日 17:36 link permalink

plus you can get others to real time

blay_paul blay_paul 2010年4月10日 18:03 2010年4月10日 18:03 link permalink

It wouldn't work for what I want to do - the maximum number of rows* is too small.

* Technically maximum number of cells, as the rows allowed varies depending on how many columns are used.

saeb saeb 2010年4月10日 17:22 2010年4月10日 17:22 link permalink

it's way neater :), i heart it :P

blay_paul blay_paul 2010年4月10日 16:39 2010年4月10日 16:39 link permalink

Update. I tried out Unidic and it reads こういう風 correctly.

Also note that the page linked above shows that you can get auto-generated audio for Japanese sentences. Obviously a human voice would be best, but auto-generated would be a good start (seeing we have so many sentences to deal with). The Unidic voice example is very good.

JimBreen JimBreen 2010年4月15日 3:13 2010年4月15日 3:13 link permalink

Unidic has some copyright issues. Kokken have wrapped it up in some typically stupid requirements. NAIST (from whence ChaSen and MeCab
come) have frozen IPADIC (which also has copyright issues) and concentrate on NAIST-JDIC which is much more kosher freeware.

Later this year I'll be starting work on building a super-large dictionary for MeCab/Chasen for a project I'm involved in. I probably won't be able to make a public release of it as I'll be using lexical material from commercial sources and I've signed all sorts of agreements. I'll explore if I can get a copy to Tatoeba.

blay_paul blay_paul 2010年4月14日 11:58 2010年4月14日 11:58 link permalink

Missing sentence? (Possibly recently deleted)

Here, if the weather's good, you can get a lovely view.

JimBreen JimBreen 2010年4月15日 3:01 2010年4月15日 3:01 link permalink

Still there. Japanese is 77859 (owned by you) and the English is 325859. It can't be found by searching for the words, for some reason. Any idea why, Trang? I often can't get to Japanese sentences when using the text as a search key, and I have to go in via the number.

TRANG TRANG 2010年4月15日 9:33 2010年4月15日 9:33 link permalink

Because it hasn't been indexed by the search engine. I haven't launched the indexation process for a while...

sysko sysko 2010年4月18日 1:12 2010年4月18日 1:12 link permalink

The index has been updated, we've switched from lucene to sphinx for the search engine, and we will try to soon make it real-time updated :)

Dorenda Dorenda 2010年4月14日 19:35 2010年4月14日 19:35 link permalink

There are two sets of sentences, one saying that Latin is a highly inflected language, and the other saying that Latin is a dead language, and they're linked. I think the Polish and Ukrainian sentences that link them should be unlinked, so that they become two seperate sets, but both owners of these sentences are not trusted users. Can you do that, TRANG, or someone else?

TRANG TRANG 2010年4月14日 21:45 2010年4月14日 21:45 link permalink

Okay, done. I could make zipangu a trusted user as well, but he hasn't been back in a while...