menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
dcsan dcsan August 14, 2014 August 14, 2014 at 10:38:38 PM UTC link Permalink

is there a JSON or YAML version of the sentence database anywhere, or any tools to make one from the daily downloads? I run this site: http://jgram.org and wanted to update our examples.

Ideally something like:

cname: uniqueId
ex:
en: sentence in english
ja: same in japanese
tags: [list, of, tags]

or similar?

{{vm.hiddenReplies[20153] ? 'expand_more' : 'expand_less'}} hide replies show replies
User55521 User55521 August 14, 2014, edited August 14, 2014 August 14, 2014 at 10:48:38 PM UTC, edited August 14, 2014 at 10:48:59 PM UTC link Permalink

Have you seen the downloads section on this site: http://tatoeba.org/eng/downloads ?

The format is very different from what you describe, though.

{{vm.hiddenReplies[20154] ? 'expand_more' : 'expand_less'}} hide replies show replies
dcsan dcsan August 14, 2014 August 14, 2014 at 11:01:32 PM UTC link Permalink

yes i have, but to match english and japanese I'd need to write a script to pair sentences together using the links.csv. I'm sure i could manage that but just wondering what there was already. It would seem to make the content much more accessible for people if that step wasn't required.

I came across this https://github.com/allan-simon/tatodb
but running my own graphDB written in C is a bit esoteric. a JSON version of the corpus that could go straight into mongoDB would be ideal.

{{vm.hiddenReplies[20155] ? 'expand_more' : 'expand_less'}} hide replies show replies
User55521 User55521 August 15, 2014, edited August 15, 2014 August 15, 2014 at 3:27:49 AM UTC, edited August 15, 2014 at 3:37:52 AM UTC link Permalink

You want a table format, while Tatoeba corpus is a graph. See the difference here: http://blog.tatoeba.org/2010/02...eba.html#rule2

Graph format cannot be converted to table without losing information. For example, sentence "Ul alma aşıy" may be translated as "He eats an apple" and "She eats an apple" or even "It eats an apple" (and also "He is eating an apple", "She is eating an apple"). In your format, it cannot be stored:
{
tat: "Ul alma aşıy.",
yue: "佢食緊個蘋果。",
eng: ???
}

Which variant should we choose? "He eats an apple", or "She eats an apple", or "He is eating an apple", or "She is eating an apple"?

The format you are asking for just doesn’t work well for Tatoeba.

(It can, however, work for a subset of Tatoeba data. If you just need a Japanese examples, you can use only one translation — probably choose it randomly — and ignore all the others.

Of course, diffrent people need different subsets, so Tatoeba can’t provide all the possible subsets. You need to generate your own subset yourself.)

Also, you probably don’t need TatoDB.

dcsan dcsan August 17, 2014 August 17, 2014 at 7:47:34 AM UTC link Permalink

Thanks for the replies!

Is it a true graph DB in that the links are not always both ways?

otherwise if its just a question of multiple entries, you can easily do that with a JSON/YAML format, or a document DB like mongo, you just add an array for the entries:

{
_id: 1234
tat: ["Ul alma aşıy."],
eng: ["It eats an apple", "she eats an apple", "etc"]
}

Is there a script around to just build a simple table list with example pairs? For my purpose at JGram I just need one example (maybe the first is the most representative?)

{{vm.hiddenReplies[20165] ? 'expand_more' : 'expand_less'}} hide replies show replies
User55521 User55521 August 17, 2014 August 17, 2014 at 6:44:58 PM UTC link Permalink

No, the links are bidirectional. If X is a translation of Y, then Y is a translation of X.

I don't know of ready-make scripts for creating sentence pairs, but I think they shouldn't be difficult to write...


Unfortunately, arrays won't work well because information is still lost:
{
_id: 1234
tat: ["Ul alma aşıy."],
yue: ["佢食緊個蘋果。"]
cmn: ["她在吃個蘋果。", "他在吃個蘋果。"]
eng: ["It eats an apple.", "She eats an apple.", "He eats an apple."]
}

In such format, there is no way to know than Mandarin "她在吃個蘋果。" means "She eats an apple." (and not "He eats an apple.").

{{vm.hiddenReplies[20171] ? 'expand_more' : 'expand_less'}} hide replies show replies
dcsan dcsan August 18, 2014 August 18, 2014 at 12:17:40 AM UTC link Permalink

ok got it. so each LINK is bidirectional, but the sentences aren't formed into clear-cut groups where all meanings for all sentences pair out.

something like this:
https://www.dropbox.com/s/3ajeb...word-graph.png

well, that's certainly an interesting format, and solves a lot of the conceptual issues I was having with my Japanese grammar database too, in that some phrases in english relate to japanese, but only in certain contexts. ie not a 1:1 translation that's always valid.

when i get some simple scripts built out I'll share them back.

thanks!