Latest files
| File | Description |
|---|---|
|
Download:
sentences
Date: Jan 11th, 2009
Size: 21.4 MB
|
Fields: "id"; "lang"; "text" Contains all the sentences. Each sentence is associated to a unique id and a language code ( ISO 639-3 ). |
|
Download:
links
Date: Jan 11th, 2009
Size: 7.9 MB
|
Fields: "sentence_id"; "translation_id" Contains the links between the sentences. "1";"77" means that sentence nº77 is the translation of sentence nº1. The reciprocal link is also present. In other words, you will also have a line that say "77";"1". |
|
Download:
romaji
Date: Nov 6th, 2009
Size: 9 MB
|
Fields: "sentence_id"; "text" Contains the romaji for Japanese sentences. Note that the romaji has been automatically generated and is not always reliable. |
|
Download:
jpn_indices
Date: Jan 11th, 2009
Size: 17.7 MB
|
Fields: "sentence_id"; "meaning_id", "text" Contains the equivalent of the "B lines" in the file of the Tanaka Corpus distributed by Jim Breen (cf. Current format, on this page ). Each entry is associated to a pair of Japanese/English sentences. sentence_id refers to the id of the Japanese sentence. meaning_id refers to the id of the English sentence. |
General information about the files
The data is provided in CSV files, encoded in UTF-8 without BOM. Fields are terminated by a semi-colon and enclosed by double quotes.
Most of the Japanese and English sentences are from the Tanaka Corpus , which belongs to the public domain. In other words, most of the sentences in Tatoeba are from there. Note that this corpus will now be maintained from Tatoeba, so you will find the most up-to-date data here.
Some of the sentences are anoted with brackets and Trang was too lazy to strip them off. In case you wonder, they were used to indicate the correspondance of words between a sentence and its translations. For instance I am {happy}{1} and Je suis {content}{1}. The brackets here indicate that happy and content mean the same thing.