Warning

The data you will find here will NOT be useful unless you are coding a language tool or doing some work on data processing.

If you want data that you can use as a humble language learner, you can check out the lists section where you can build your own lists of sentences or view others' lists and print them.

Creative commons

These files are released under CC-BY.

Creative Commons License

For those who wonder why we're not leaving the data in the public domain, some explanations here.

Questions?

If you have questions or requests, feel free to contact us . In general we answer quickly.

Latest files

File Description
Download: sentences
Date: Jan 11th, 2009
Size: 21.4 MB

Fields: "id"; "lang"; "text"

Contains all the sentences. Each sentence is associated to a unique id and a language code ( ISO 639-3 ).

Download: links
Date: Jan 11th, 2009
Size: 7.9 MB

Fields: "sentence_id"; "translation_id"

Contains the links between the sentences. "1";"77" means that sentence nº77 is the translation of sentence nº1. The reciprocal link is also present. In other words, you will also have a line that say "77";"1".

Download: romaji
Date: Nov 6th, 2009
Size: 9 MB

Fields: "sentence_id"; "text"

Contains the romaji for Japanese sentences. Note that the romaji has been automatically generated and is not always reliable.

Download: jpn_indices
Date: Jan 11th, 2009
Size: 17.7 MB

Fields: "sentence_id"; "meaning_id", "text"

Contains the equivalent of the "B lines" in the file of the Tanaka Corpus distributed by Jim Breen (cf. Current format, on this page ). Each entry is associated to a pair of Japanese/English sentences. sentence_id refers to the id of the Japanese sentence. meaning_id refers to the id of the English sentence.

General information about the files

The data is provided in CSV files, encoded in UTF-8 without BOM. Fields are terminated by a semi-colon and enclosed by double quotes.

Most of the Japanese and English sentences are from the Tanaka Corpus , which belongs to the public domain. In other words, most of the sentences in Tatoeba are from there. Note that this corpus will now be maintained from Tatoeba, so you will find the most up-to-date data here.

Some of the sentences are anoted with brackets and Trang was too lazy to strip them off. In case you wonder, they were used to indicate the correspondance of words between a sentence and its translations. For instance I am {happy}{1} and Je suis {content}{1}. The brackets here indicate that happy and content mean the same thing.