Dês (mewzûyêk)
Tîpî
Verê perskerdişê persêk xeyrê xo persa xo ser o PZP de cigêrayîş bikerê.
Ma wazenîme ke seba munaqeşeyanê medenîyan atmosferêko rindane îdame bikerîme. Xeyrê xo qaydeyanê ma yê verba hereketanê xiraban biwanê.
AmarMecheri
yew saete ver
Shishir
yew saete ver
AmarMecheri
yew saete ver
AmarMecheri
vizêr
deniko
vizêr
deniko
vizêr
frpzzd
vizêr
araneo
vizêr
deniko
vizêr
deniko
vizêr

I think we should add Piedmontese language (Piemontèis/lenga piemontèisa) to the language menu. Piemontèis, although not recognized as official by the italian government, is an actual language still spoken by a lot of people in the Piedmont region in northwestern Italy. It has a grammar and vocabulary of its own with substantial differences with the italian ones.
Please add Piemontèis language and I'll contribute with many sentences and translations directly from native speakers.
Thank you

But please note that the FAQ is out of date, now you should contact but sysko, not Trang (the email is same thoug, team@tatoeba.org).

thank you!
sent the e-mail.
though I can't seem to understand very well the third step... do you know where I should go to add these new 5 sentences?
thanks

You have to add sentences here, on Tatoeba. :)
You can add new sentences here:
http://tatoeba.org/eng/sentences/add
(Choose any language right now; it will be changed later)
Or you can just add translations of other sentences.

Thank you again
now if you're interested you can check the new list Piemontèis :)

It has an ISO code (pms), so it can be added.
Please refer to the FAQ:
http://tatoeba.org/eng/faq#new-language

I want to contribute to adding English to Tamil translation.
But I do not see Tamil language in the menu option. Tamil is official in at least 3 countries - India, Singapore and Sri Lanka and unofficial minor language in several countries including Malaysia, Mauritious, South Africa etc.
It would be very useful several Tamil-speaker if Tamil can be included in the options.
Thank you,
Rajesh

The praxis is to contribute some number of sentences in the language, then submitting them to Sysko proposing the addition of the new language, along with desired flag and the ISO code of the language; if I'm not wrong there's nothing else.

Thies means: For the very first time simply use "unknown language" instead of Tamil.

Is there any statistics available about the number of connections between languages? Something simple, like the sentences statistics:
http://tatoeba.org/eng/stats/sentences_by_language
but counting not only number of sentences, but also the number of the direct translations/links from this language to others?

Hello! Here are the number of links from and to each language. The entire list is 4458 lines long, so I won't put it all here. Anyone have any ideas where best to upload it? I can also share the Ruby file I generated it with if anyone wants it.
Here are the top 10:
eng jpn 173372
eng fra 93088
eng spa 85177
eng tur 73488
eng deu 63229
fra epo 60453
deu epo 56084
eng epo 51520
deu fra 48503
eng ita 42851

The whole list can be seen here: https://docs.google.com/spreads...U1xTWVZc3JMQ3c

Wow! This is fascinating statistics, thanks a lot! Google docs is an ideal place to share a document

No problem! :)

Gibt es eine Möglichkeit, völlig voneinander getrennte Sätze zu verknüpfen?

Ja, Pfirsichbäumchen, das ist möglich mit dem Programm Visual Linker
http://userscripts.org/scripts/show/118538
(Dafür ist erforderlich zunächst Greasemonkey zu installieren.)

Danke schön! :o)

Is there any way to download the audio files as a set? I can't see a link from the Download page. If not, is there any objection to scraping the audio from the website using a tool? If that's OK, is there a speed I should limit myself to? I don't want to hammer the server.

Note to myself: create an archive of all the audio files (and once it will be done if someone can create a torrent of them that would be nice)

Translation of tatoeba itself...
I was wondering if the Tatoeba interface in different languages is still being updated through launchpad. I believe I have very good reasons to ask:
In the case of Turkish;
- With all respect to previous contributors to translation efforts; there are basic phrases which I assume were misunderstood and thus inappropriately translated.
A very simple example would be the "from" and "to" words above the combo boxes in search area. Most probably a lack of context led the translator to interpret those words as if they refer to people (like in an email).
- Some parts are still not translated.
So my point is, I made a few addition/alteration on launchpad just to check if it is still valid, but it doesn't seem to be. Thank you.

I think that Launchpad is still working, but it says in the FAQ that "You should know that the whole process is not automated yet." So Sysko still has to manually update the translations.
There was also some vandalism of the translations on Launchpad. I don't know if that's still causing problems.

the vandalism problems have been fixed, so now it's only because i didn't found time to put the updated version of the interface texts on launchpad (actually once again that's one of the process that was handled by Trang and for which i need to find again how it works, though i'm pretty sure that somewhere in my mails it's explained)
at the end of the month i'll be in vacation for 4 months so at the beginning of july I should be much more available.

By the way, I seem to have missed it, but what happened to Trang? :-?

i knew she was quite busy with her job, but i dunno exactly

I'd better make the changes I intend to do before then. :) Thank you both.

I'm done! (I know you were not waiting merely for me to update it, but just to let you know...)

Pensais que es demasiado pronto para ser administrador del corpus?
Estoy repasando banderas erroneas y etiquetas y veo que me falta permisos para trabajar bien. Además estoy dando algo de trabajo extra a los admin que me quieren ayudar.
Seria mas facil para mi y los demas tener permisos extra.
May is too soon for be a corpus adminstrator?
I m working with many tags and flag and there are things that I cannot do. Then I have to disturb other adminstrators.
What you think?

hello everybody. i am new and i am too excited

Welcome to Tatoeba!
Do you already know how things work around here? Don’t hezitate to ask!

Licensing question.
Having just read a thread from a month ago, I just realized that it is possible I am in violation of the Tatoeba license, unwittingly. I am the author of a free software (GPL) language study tool which uses the Tatoeba corpus for example sentences. I give credit to the Tatoeba project in the COPYING section of the code.
However, based on reading a thread from a month ago, it is clear that some users would like *personal* recognition for their work. I'm quite happy to oblige, but I have no idea how these users want recognition.
Is it sufficient to put the sentence owner's user name next to the sentence when it is displayed? Note: This is not a web app so making a url to the sentence would be unhelpful.
This really needs to be made a lot more obvious and explicit. Since individual sentences are likely not able to be copyrighted, I had assumed that the copyright was on the corpus as collection. However, people who make large contributions are probably entitled to make copyright claims, so this is potentially a major problem.
I have a request: Please make the license explicit with respect to how recognition should be given to contributors, especially given the fact that users are pseudonymous. Without this information, it really does make the data useless.
I'd love to write a long rant what a difficult subject licensing is, especially when you consider the subtle differences in copyright law from country to country. Things that seem perfectly obvious from one perspective is far from obvious from another perspective. But I'll save the rant for another day...

Alternatively, you can extract the list of all authors from the sentences_detailed.csv[1] file. So probably you can simply put the list of all users in the About box somewhere... It's gonna be a long list, though.
> Since individual sentences are likely not able to be
> copyrighted, I had assumed that the copyright was
> on the corpus as collection.
It’s quite a difficult question. Sysko said, that French court once said that sentences *can* be copyrighted if the contain artistic value. That’s why it's forbidden to copy others’ sentences on Tatoeba without explicit permission: to be on the safe side.
[1] http://tatoeba.org/files/downlo...s_detailed.csv

It's definitely tricky. I think the policy is a sound one, although the original Tanaka corpus contained quite a few non-original English sentences. To me the issue seems slightly similar to the licensing problem that Open Street Maps had. In that case, the data itself is definitely can not be copyrighted and there was some concern over the applicability of CC for a collection of data. In the end, they went with a share-alike database license.
I seriously wring my hands every time I have to deal with licenses since the issue is so difficult.

I think we should develop a type of license through which it would be possible to unconditionally use the content if the usage is clever and useful and impossible to do so if the usage is not.
I would never contemplate suing anybody who advocates education and learning. But I would definitely do anybody who's trying to make money or mediocre use of my hard work. I'll do it as a matter of principle.

The easiest way to do this is simply not to grant any license at all until you review what the person wants to do with it. This is actually the more "normal" route for licensing.
There are advantages and disadvantages to the approach, though. Clearly the big advantage is choosing who can and can't use the data. The disadvantages are a little more subtle.
The less restrictive you make a license, the easier it is for people to contribute to the project. For example, it is important for me that my project allows people to modify it for their own purposes. That means that every part needs to have a license not just for me, but for all of my users too. If Tatoeba had a more restrictive license, I would have absolutely no interest in using it.
With the less restrictive license, I use it in my project and I have a lot of incentive to see that it continues to be a success. For example, I will almost certainly contribute Japanese audio in the near future.
My own humble project, although I have worked hard on it, is very small. But I use resources that would be impossible to gather without the permissive licenses that have become popular. I use EDICT, KANJIDIC, CC-CEDICT, a kanji stroke order font, a Japanese de-inflection table, JLPT vocabulary lists, Japanese example sentences showing grammar, etc. That doesn't even mention the software resources like ruby, rspec, GTK+, etc, etc, etc. Even though I've put in a couple of thousand hours into my code, the resources that I depend on are easily in the millions of hours and touched by thousands of people.
This is the true power of "free culture". It's a balancing act. We don't want people just sponging off our work, but easy and open access allows people to do surprising and wonderful things. Licensing is the tightrope you need to walk to get everything working. From my own perspective, Tatoeba has got it right, even if some people profit without contributing anything.

I think that is the point of the CC-BY license: as soon as anybody takes money for the work he won’t get far, because everybody will see that the sentences are freely available here.
However, if someone has a clever idea how to make money with the sentences (as long as not by tricking someone into not knowing that they are freely available) i think that it’s fair to do so. E.g. if someone would convert the sentences to a mobile app and charges for that.

There is ‘Terms of Use’ section of the site:
http://tatoeba.org/eng/terms_of_use#eng-version
> Attribution: To re-distribute a text page in any form, provide credit to the authors either by including
> a) a hyperlink (where possible) or URL to the
> page or pages you are re-using,
> b) a hyperlink (where possible) or URL to an
> alternative, stable online copy which is freely
> accessible, which conforms with the license, and
> which provides credit to the authors in a manner
> equivalent to the credit given on this website, or
> c) a list of all authors. (Any list of authors may
> be filtered to exclude very small or irrelevant
> contributions.)
Therefore, I believe (though I may be mistaken) that adding a hyperlink will suffice, even though it is an offline application.
By the way, where can we see your application? I'd like to have a look at it. :)

Thanks for the feedback. I'm currently using option b. But to be explicit, I'm only linking to the Tatoeba website, not a link to each individual sentence.
What I will probably end up doing is removing Tatoeba completely and rather let the user download it. This would allow them to update it more freely anyway.
My project is called JLDrill: http://jldrill.rubyforge.org/ It was originally a tool for Japanese study, but I've slightly enhanced it to allow for Chinese study. Basically it's a spaced repetition software program similar to Anki. However, it features a somewhat improved (IMHO) spaced repetition algorithm, integrated Edict (or CEDict) dictionary, popup chinese character/word search capability (similar to the Rikaichan plugin for Firefox), and of course Tatoeba corpus lookup.
It's written in Ruby using GTK+ for a GUI. It's a bit of a pain to install on Windows, but possible. Mac is rather more difficult because you have to build all the dependencies (I don't have a Mac unfortunately). Linux/BSD is probably the easiest platform. The current version has a few known bugs, but unfortunately I started ripping out the internals and it's taking me a while to put it back together again...

Hello tatoeba community. This project is a great tool for multiple purposes, unquestionably the most common of which being a reliable source for those in need of example sentences for their foreign language studies.
I'm not very much familiar with the project and how things go on, but as my short experience has revealed there are a few points that needs to be made clear. You can consider it partially a suggestion or a question.
Like any other use I guess, I use the search area above to find some sentences which are related to my keywords. Those keywords are sometimes (for some languages) very very tricky. Let's say I search for a verb to find sentences that contains this verb. Yet in the search results there are a lot more than that: some nouns which look like this verb but have nothing to do with it, another verb which looks like my verb when conjugated for a speacial tense, person etc.
At this point, the question is: is it really a good idea to keep a sentence as a simple text in the database? Wouldn't it be better if we create a metadata (more developed than tags) for each sentence. By doing so we can use this database more efficiently and define the relations between languages more clearly. It will also help us to maintain more developed search facilities.
Just to be more clear I'm giving a very simple example. When I add a sentence in German:
"Meine Geschichte ist zu grotesk, um eine Lüge zu sein."
I should add an extra information about the verb for instance: The verb is "ist" and the infinitive form is "sein"
So when a user wants to see a sentence which contains the verb "sein", s/he can find it. Otherwise s/he will search for "sein" and many other irrelevant sentences will come up, like:
"Ich liebe seine Schwester" As one can see; this "sein-e" is not what we are looking for.
Long story short, the badly organized example of mine above is just a little demonstration of how metadata added sentence storage will help users find what they are looking for.
I'm not quite sure if this is the right place for such a suggestion/question, but here it is.

If you search the wall's history you'll get more information on what has been discussed already on the metadata subject.
You can also get info here http://blog.sysko.fr/post/13
Bienvenue sur Tatoeba !

Hoşbulduk.
Is there a search facility for wall posts? You are not expecting me to read every single post in 190+ pages, are you? :))

Actually yes in the future we will have something dedicated to metada and one of this meta field can be about this (the new architecture I'm building will not restrict us on what kind of key=>value metadata we can provide)
In parallel of tatoeba I've already started with a friend to think and code about some software to decompose the sentences into their grammatical strucutre /semantical field of the words
After we can imagine with a mix system of computer-learning/user proofread we can easily increase a lot the rate at which we add this kind of metadata.

Thank you! This is just good news. The news may not be new though, for other users :D as I just found out. As a matter of fact I took sacredceltic's advice and made a google search on tatoeba wall for metadata. I was very surprised to see years ago several other people had been talking about it, using very similar words to mine. Like this one by Zifre: http://tatoeba.org/tur/wall/sho...7#message_4637
Now I can sleep in peace :))

you can search in Google "Tatoeba metadata"
Wall posts are indexed by search engines.

I agree it's a very good idea, but not everybody would be able to "categorize" words. Believe me or not, many people don't know what a verb is...
It would be GREAT to have this resource! It wouldn't be easy to categorize more than 1.5 million sentences, but a very interesting challenge.

@alexmarcelo, It's true that not everyone can do this. But as you said there are a huge number of sentences. This is why such an initiative should be taken. Because without an accurate and clear meta-information, such a huge resource could turn into a useless bunch of sentences. Then what is the difference between tatoeba and a search result I get through a web search engine? (sorry if I sound critical, this is not what I'm trying to do. I just want to stress the necessity of that practice, as you already agreed)
@xekri, I share your stand when you say different languages have different strucutures, therefore differet categorizations. I am not familiar with Lojban, but if you say that such a mechanism would be useful for Lojban too; it means that a common denominator needs to be found. A common solution which can apply to every language.

The Japanese indeces contain exactly the information that you want, but only for Japanese. Also, unless it has been added recently, there is no general UI for updating the Japanese indeces. I would dearly love this to become a general feature of Tatoeba. There are definitely a lot of issues that have to be thought out, but IMHO this is a must have feature.