Konu #20153 - Tatoeba

Menü

dcsan 14 Ağustos 2014 14 Ağustos 2014 22:38:38 UTC

link

Kalıcı bağlantı

is there a JSON or YAML version of the sentence database anywhere, or any tools to make one from the daily downloads? I run this site: http://jgram.org and wanted to update our examples.

Ideally something like:

cname: uniqueId
ex:
en: sentence in english
ja: same in japanese
tags: [list, of, tags]

or similar?

cevapları gizle cevapları göster

User55521 14 Ağustos 2014, 14 Ağustos 2014 tarihinde düzenlendi 14 Ağustos 2014 22:48:38 UTC, 14 Ağustos 2014 22:48:59 UTC düzenlendi

link

Kalıcı bağlantı

Have you seen the downloads section on this site: http://tatoeba.org/eng/downloads ?

The format is very different from what you describe, though.

cevapları gizle cevapları göster

dcsan 14 Ağustos 2014 14 Ağustos 2014 23:01:32 UTC

link

Kalıcı bağlantı

yes i have, but to match english and japanese I'd need to write a script to pair sentences together using the links.csv. I'm sure i could manage that but just wondering what there was already. It would seem to make the content much more accessible for people if that step wasn't required.

I came across this https://github.com/allan-simon/tatodb
but running my own graphDB written in C is a bit esoteric. a JSON version of the corpus that could go straight into mongoDB would be ideal.

cevapları gizle cevapları göster

User55521 15 Ağustos 2014, 15 Ağustos 2014 tarihinde düzenlendi 15 Ağustos 2014 03:27:49 UTC, 15 Ağustos 2014 03:37:52 UTC düzenlendi

link

Kalıcı bağlantı

You want a table format, while Tatoeba corpus is a graph. See the difference here: http://blog.tatoeba.org/2010/02...eba.html#rule2

Graph format cannot be converted to table without losing information. For example, sentence "Ul alma aşıy" may be translated as "He eats an apple" and "She eats an apple" or even "It eats an apple" (and also "He is eating an apple", "She is eating an apple"). In your format, it cannot be stored:
{
tat: "Ul alma aşıy.",
yue: "佢食緊個蘋果。",
eng: ???
}

Which variant should we choose? "He eats an apple", or "She eats an apple", or "He is eating an apple", or "She is eating an apple"?

The format you are asking for just doesn’t work well for Tatoeba.

(It can, however, work for a subset of Tatoeba data. If you just need a Japanese examples, you can use only one translation — probably choose it randomly — and ignore all the others.

Of course, diffrent people need different subsets, so Tatoeba can’t provide all the possible subsets. You need to generate your own subset yourself.)

Also, you probably don’t need TatoDB.

dcsan 17 Ağustos 2014 17 Ağustos 2014 07:47:34 UTC

link

Kalıcı bağlantı

Thanks for the replies!

Is it a true graph DB in that the links are not always both ways?

otherwise if its just a question of multiple entries, you can easily do that with a JSON/YAML format, or a document DB like mongo, you just add an array for the entries:

{
_id: 1234
tat: ["Ul alma aşıy."],
eng: ["It eats an apple", "she eats an apple", "etc"]
}

Is there a script around to just build a simple table list with example pairs? For my purpose at JGram I just need one example (maybe the first is the most representative?)

cevapları gizle cevapları göster

User55521 17 Ağustos 2014 17 Ağustos 2014 18:44:58 UTC

link

Kalıcı bağlantı

No, the links are bidirectional. If X is a translation of Y, then Y is a translation of X.

I don't know of ready-make scripts for creating sentence pairs, but I think they shouldn't be difficult to write...

Unfortunately, arrays won't work well because information is still lost:
{
_id: 1234
tat: ["Ul alma aşıy."],
yue: ["佢食緊個蘋果。"]
cmn: ["她在吃個蘋果。", "他在吃個蘋果。"]
eng: ["It eats an apple.", "She eats an apple.", "He eats an apple."]
}

In such format, there is no way to know than Mandarin "她在吃個蘋果。" means "She eats an apple." (and not "He eats an apple.").

cevapları gizle cevapları göster

dcsan 18 Ağustos 2014 18 Ağustos 2014 00:17:40 UTC

link

Kalıcı bağlantı

ok got it. so each LINK is bidirectional, but the sentences aren't formed into clear-cut groups where all meanings for all sentences pair out.

something like this:
https://www.dropbox.com/s/3ajeb...word-graph.png

well, that's certainly an interesting format, and solves a lot of the conceptual issues I was having with my Japanese grammar database too, in that some phrases in english relate to japanese, but only in certain contexts. ie not a 1:1 translation that's always valid.

when i get some simple scripts built out I'll share them back.

thanks!

Menü

Yardım mı lazım?

Geliştiriciler

Hakkında