menu
Tatoeba
language
Registriĝi Ensaluti
language Esperanto
menu
Tatoeba

chevron_right Registriĝi

chevron_right Ensaluti

Foliumi

chevron_right Montri hazardan frazon

chevron_right Foliumi laŭ lingvo

chevron_right Foliumi laŭ listo

chevron_right Foliumi laŭ etikedo

chevron_right Foliumi sonregistraĵojn

Komunumo

chevron_right Muro

chevron_right Listo de ĉiuj membroj

chevron_right Lingvoj de la membroj

chevron_right Denaskaj parolantoj

search
clear
swap_horiz
search
Demetrius {{ icon }} keyboard_arrow_right

Profilo

keyboard_arrow_right

Frazoj

keyboard_arrow_right

Vortaro

keyboard_arrow_right

Revizioj

keyboard_arrow_right

Listoj

keyboard_arrow_right

Preferaĵoj

keyboard_arrow_right

Komentoj

keyboard_arrow_right

Komentoj pri frazoj de Demetrius

keyboard_arrow_right

Muraj mesaĝoj

keyboard_arrow_right

Registroj

keyboard_arrow_right

Sono

keyboard_arrow_right

Transskriboj

translate

Traduki la frazojn de Demetrius

Surmuraj mesaĝoj de s% (entute Demetrius)

Demetrius Demetrius 2010-septembro-17 2010-septembro-17 15:52:06 UTC link Konstanta ligilo

> that's cheating
Of course it is. ;)

But what I meant is that dictionaries often provide example sentences. It depends on a dictionary. And technically half of the Tatoeba sentences can easily end up in a dictionary. It's not a reason to delete it.

Demetrius Demetrius 2010-septembro-17 2010-septembro-17 14:12:42 UTC link Konstanta ligilo

> rather than word I will say "one sementical unit"
What is a semantical unit? A seme?
Then "Buy" has 2 of these.

> that if an entry in tatoeba can also be
> found in a dictionnary, (delta the flexion)
In WWWJDIC you can find most phrases from Tatoeba.
WWWJDIC is a dictionary.

It means:
All Japanese phrases should be deleted.

IMHO all needs moderation. Boracasli lacks it, but forbiding all the 1-word sentences isn't any better.

Demetrius Demetrius 2010-septembro-16 2010-septembro-16 09:37:33 UTC link Konstanta ligilo

But Buy is an example of imperative mood.

In may languages it would have a T-V distinction...

Demetrius Demetrius 2010-septembro-16 2010-septembro-16 08:59:39 UTC link Konstanta ligilo

Actually, I don't understand why "Buy" is less important than "Cat is not human". >_<

Demetrius Demetrius 2010-septembro-16 2010-septembro-16 08:15:41 UTC link Konstanta ligilo

Can you give a more clear guidelines?

IMHO sentences shouldn't be deleted simply because you suspect they were taken from a dictionary.


Also consider polysynthetic languages, where a great lot of very useful phrases can be said in one word. For example, in Chukchi phrasebook I’ve found the following single word sentences:
«Титэтгивик?» means «How much?»
«Тантыԓянвыԓьын?» means «Is the road good?»

Do you think they are also out of the scope of this project?

Demetrius Demetrius 2010-septembro-15 2010-septembro-15 11:39:51 UTC link Konstanta ligilo

But is useful for natural language processing:
a) automatic translators,
b) sentence classification

Tatoeba is a text corpus. Programmers can write an algorithm, but they need a text corpus to make it work.



For example, in Tatoeba bad sentences have tags "rude", "offensive", "XXX".

Using a simple alghorithm[1] and Tatoeba sentences, anyone can write a program that can look at any sentence in the same language and say: "It's rude" or "It's not rude". It then can be used, for example, to hide some text from children.

Or, for example, it's possible to create a program that detects a language using Tatoeba data.

Or check whether the text is optimistic or pessimistic.

Or even to create automatic translators. (But for these, a lot of text is neccessary. For many language we have too few sentences for this... now :))

And many other things... Practically all programs working with language need a text corpus!


Tatoeba is not the only corpus, there are many of them. But Tatoeba is better because:
* It's free,
* It's multilingual (usually corpora support only 1 language, or 2, not more)


[1] For example, you can use a naive Bayesian classifier for this.

Demetrius Demetrius 2010-septembro-15 2010-septembro-15 11:01:46 UTC link Konstanta ligilo

=))

We don't. :) On the wall, there may be discussion. But if it's a sentence, it's 100% OK.

We need different sentences! ^^

And we do have patriotic sentences. :)

See:
http://tatoeba.org/eng/sentences/show/467460
http://tatoeba.org/eng/sentences/show/485186

Demetrius Demetrius 2010-septembro-15 2010-septembro-15 09:50:26 UTC link Konstanta ligilo

Cool, thank you. ^^

Demetrius Demetrius 2010-septembro-15 2010-septembro-15 09:48:33 UTC link Konstanta ligilo

زنده باد زبان فارسی
:)

Can you add this as a sentence please? :)

Demetrius Demetrius 2010-septembro-14 2010-septembro-14 17:37:17 UTC link Konstanta ligilo

Cool!

Demetrius Demetrius 2010-septembro-14 2010-septembro-14 14:17:02 UTC link Konstanta ligilo

Now I don’t know if I know what I’ve said.

ö (It’s my new way of writing :o)

Demetrius Demetrius 2010-septembro-13 2010-septembro-13 00:20:43 UTC link Konstanta ligilo

What the?..

What is it supposed to mean?

Demetrius Demetrius 2010-septembro-13 2010-septembro-13 00:19:28 UTC link Konstanta ligilo

0

Demetrius Demetrius 2010-septembro-12 2010-septembro-12 21:55:23 UTC link Konstanta ligilo

IMHO their license is too restrictive.

Demetrius Demetrius 2010-septembro-12 2010-septembro-12 21:53:05 UTC link Konstanta ligilo

IMO, this kind of metadata is not fit for tags.

Demetrius Demetrius 2010-septembro-12 2010-septembro-12 21:47:36 UTC link Konstanta ligilo

Thank you for the link! ^^

Demetrius Demetrius 2010-septembro-12 2010-septembro-12 21:44:27 UTC link Konstanta ligilo

+1

Demetrius Demetrius 2010-septembro-12 2010-septembro-12 21:05:01 UTC link Konstanta ligilo

Wictionary is hard to edit for an average user.

It’s hard to find a balance between a computer-parsable dictionary and a easy-to-edit for an average human being. The Wiktionary is *much* *more* *complicated* than Tatoeba.

Aslo, although it’s exportable and parseable, but I haven’t seen any program that presents the exported data in a form of a bilingual dictionary.

All in all, I believe the Wiki engine is not fit for creating dictionaries.



I think we’ll run into the problem of a dictionary later:
1. Now we have some tags [verb_of_motion, Genitive] that are better fit as tags for words, not for sentences. => We need tags for words.
2. We can’t tag all the words, or force users to do it, since it’s too much work. => We need a morphology analyser.
3. Morphology data about the words need the dictionary. Wiktionary is hard to edit for an average user and rarely exported. => We need something more lightweight.

So I believe one day something like Tatoeba dictionary will emerge.

Also, there is a problem: what language edition of Wiktionary to choose? The explanations are different, but the translations in all Wiktionaries in fact duplicate each other.

Demetrius Demetrius 2010-septembro-12 2010-septembro-12 20:54:37 UTC link Konstanta ligilo

By the way, there is a secret copy of Tatoeba. ;) It is blue.

Demetrius Demetrius 2010-septembro-12 2010-septembro-12 11:01:30 UTC link Konstanta ligilo

Well, it depends on the language.

Arabic script for Uyghur shows all the vowels.