Menü
is there a JSON or YAML version of the sentence database anywhere, or any tools to make one from the daily downloads? I run this site: http://jgram.org and wanted to update our examples.
Ideally something like:
cname: uniqueId
ex:
en: sentence in english
ja: same in japanese
tags: [list, of, tags]
or similar?
Have you seen the downloads section on this site: http://tatoeba.org/eng/downloads ?
The format is very different from what you describe, though.
yes i have, but to match english and japanese I'd need to write a script to pair sentences together using the links.csv. I'm sure i could manage that but just wondering what there was already. It would seem to make the content much more accessible for people if that step wasn't required.
I came across this https://github.com/allan-simon/tatodb
but running my own graphDB written in C is a bit esoteric. a JSON version of the corpus that could go straight into mongoDB would be ideal.
You want a table format, while Tatoeba corpus is a graph. See the difference here: http://blog.tatoeba.org/2010/02...eba.html#rule2
Graph format cannot be converted to table without losing information. For example, sentence "Ul alma aşıy" may be translated as "He eats an apple" and "She eats an apple" or even "It eats an apple" (and also "He is eating an apple", "She is eating an apple"). In your format, it cannot be stored:
{
tat: "Ul alma aşıy.",
yue: "佢食緊個蘋果。",
eng: ???
}
Which variant should we choose? "He eats an apple", or "She eats an apple", or "He is eating an apple", or "She is eating an apple"?
The format you are asking for just doesn’t work well for Tatoeba.
(It can, however, work for a subset of Tatoeba data. If you just need a Japanese examples, you can use only one translation — probably choose it randomly — and ignore all the others.
Of course, diffrent people need different subsets, so Tatoeba can’t provide all the possible subsets. You need to generate your own subset yourself.)
Also, you probably don’t need TatoDB.
Thanks for the replies!
Is it a true graph DB in that the links are not always both ways?
otherwise if its just a question of multiple entries, you can easily do that with a JSON/YAML format, or a document DB like mongo, you just add an array for the entries:
{
_id: 1234
tat: ["Ul alma aşıy."],
eng: ["It eats an apple", "she eats an apple", "etc"]
}
Is there a script around to just build a simple table list with example pairs? For my purpose at JGram I just need one example (maybe the first is the most representative?)
No, the links are bidirectional. If X is a translation of Y, then Y is a translation of X.
I don't know of ready-make scripts for creating sentence pairs, but I think they shouldn't be difficult to write...
Unfortunately, arrays won't work well because information is still lost:
{
_id: 1234
tat: ["Ul alma aşıy."],
yue: ["佢食緊個蘋果。"]
cmn: ["她在吃個蘋果。", "他在吃個蘋果。"]
eng: ["It eats an apple.", "She eats an apple.", "He eats an apple."]
}
In such format, there is no way to know than Mandarin "她在吃個蘋果。" means "She eats an apple." (and not "He eats an apple.").
ok got it. so each LINK is bidirectional, but the sentences aren't formed into clear-cut groups where all meanings for all sentences pair out.
something like this:
https://www.dropbox.com/s/3ajeb...word-graph.png
well, that's certainly an interesting format, and solves a lot of the conceptual issues I was having with my Japanese grammar database too, in that some phrases in english relate to japanese, but only in certain contexts. ie not a 1:1 translation that's always valid.
when i get some simple scripts built out I'll share them back.
thanks!