Warning

The data you will find here will NOT be useful unless you are coding a language tool or doing some work on data processing.

If you want data that you can use as a humble language learner, you can check out the lists section where you can build your own lists of sentences or view others' lists and print them.

Creative commons

These files are released under CC-BY.

Creative Commons License

For those who wonder why we're not leaving the data in the public domain, some explanation here.

Questions?

If you have questions or requests, feel free to contact us. In general we answer quickly.

Downloads

Sentences

Download
1. http://tatoeba.org/files/downloads/sentences.csv
2. http://tatoeba.org/files/downloads/sentences_detailed.csv
Fields and structure
1. id [tab] lang [tab] text
2. id [tab] lang [tab] text [tab] username [tab] date_added [tab] date_last_modified
Description
Contains all the sentences. Each sentence is associated with a unique id and a language code (ISO 639-3).
We provide two files. The first file (sentences.csv) only contains the minimum. The second file (sentences_detailed.csv) contains more information, for those who would like to filter the sentences based, for instance, on the contributor who owns the sentence or on the date when it was added.

Links

Download
http://tatoeba.org/files/downloads/links.csv
Fields and structure
sentence_id [tab] translation_id
Description
Contains the links between the sentences. 1 [tab] 77 means that sentence #77 is the translation of sentence #1. The reciprocal link is also present. In other words, you will also have a line that says 77 [tab] 1.

Tags

Download
http://tatoeba.org/files/downloads/tags.csv
Fields and structure
sentence_id [tab] tag_name
Description
Contains the list of tags associated with each sentence. 381279 [tab] proverb means that sentence #381279 has been tagged with "proverb".

Lists

Download
http://tatoeba.org/files/downloads/user_lists.csv
Fields and structure
id [tab] username [tab] date_created [tab] date_modified [tab] list_name
Description
Contains the list of lists created.

Sentences in lists

Download
http://tatoeba.org/files/downloads/sentences_in_lists.csv
Fields and structure
list_id [tab] sentence_id
Description
Indicates the sentences that are in each of the lists. 13 [tab] 381279 means that sentence #381279 is part of the list of id 13.

Japanese indices

Download
http://tatoeba.org/files/downloads/jpn_indices.csv
Fields and structure
sentence_id [tab] meaning_id [tab] text
Description
Contains the equivalent of the "B lines" in the file of the Tanaka Corpus distributed by Jim Breen. See this page to learn the format. Each entry is associated with a pair of Japanese/English sentences. sentence_id refers to the id of the Japanese sentence. meaning_id refers to the id of the English sentence.

General information about the files

The files provided here are updated every Saturday at 9AM, France time.

Most of the Japanese and English sentences are from the Tanaka Corpus, which belongs to the public domain.