Perfil
Frasas
Vocabulary
Reviews
Lists
Marcapaginas
Comentaris
Comentaris sus las frasas de Demetrius
Cabinats
Jornals
Audio
Transcriptions
Translate Demetrius's sentences

Should we keep proverbs offending other nations in the database? And how should these be tagged?
We already have some Ukrainian proverbs about Russians in the database (sth like “You can ward off the devil crossing yourself, but you can’t ward off a Moskal”). ^^

> Good point. But transliteration can't handle
> letter pair --> single letter correspondence?
Of course it can. I mean that usually nj = њ, but in the word injekcija it's a morpheme boundary and it should be retained нј.
> Unless there are instances of
> "nj" that are нј and not њ,
> but this is never the case.
Инјекциjа!
> I disagree here. You'd need a big dictionary
> for Latin to Arabic.
In fact, you don't need a dictionary at all for Latin>Arabic. Only for zh/j and ' if people omit these.
All the other directions you need a dictionary.
> The Latin "ng" could be transliterated as either.
No, as far as I know. Latin requires breaking these with ': ng and n'g.
> I think there are other cases,
> as well (personally, I don't
> much like the Latin Uighur...)
But it's indeed very easy to process.

OK, that is a good idea. I'll send my (imperfect) Uzbek transliterator on Monday.

> So far, I've not figured out how to remove a tag from a sentence
The alghoritm is:
a) add the same tag to any other sentence (temporarily)
b) copy the delete link and replace the sentence number
c) delete your temporary tag
This is a bug. Use only when you’re absolutely sure it’s applicable.

Some more thought about tags.
BTW, IMHO tags should be as general as possible: not “2nd Person Formal”, but “2nd Person”, “Formal”. This makes them more useful for automated processing.

@FeuDRenais
Also, there was a proposal to write all tags lowercase:
http://tatoeba.org/eng/wall/sho...8#message_1588

It's easy, but I'm too lazy to do this. ^^ I've started working on Uzbek, maybe I'll finish it someday...
Also, sysko has to do transliteration caching. It will allow making transliteration more time-intensive (dictionary searches...).
> one-to-one letter correspondence between
> the alternative alphabets exists (e.g. Serbian).
But injekcija = инјекциjа, not ињекциjа. So you need a dictionary when transcribing from Latin...
> Uighur
Latin > Arabic is the easiest (unless people omit ' and don't differenciate j/zh ^^).
Since we have no Latin Uyghur sentences, there is no rush.
Others require a LARGE dictionary of proper names, since Arabic has no capital letters. Cyrillic requires a dictionary of Russian loanwords.
> and - I think, but Demetrius could confirm - Tatar
No, this one is tricky in both directions. The hardest part is q and ğ. Usually к = k and г = g w/ front vowels, к = q and г = q with back ones.
But Arabic words break vowel harmony:
нигъмәт — niğmət ‘dish’ (ğ is marked as гъ), сәгать — səğət ‘clock’ (ğ is marked by changing vowel letter, the vowel quality is marked by the soft sign), сәгатем — səğətem ‘my clock’ (ğ is marked by a vowel letter w/out a soft sign)
Russian loanwords break vowel harmony in other way, they force the K and G even near back vowels.
Also, there are W and V:
В = V (вагон — vagon ‘carriage’), W (авыл — awıl ‘village’)
У = U (су — su ‘water’), W (тау = taw ‘mountain’)
Ү = Ü (күрү — kürü ‘see’), W (Мəскəү — Məskəw ‘Moscow’)

Please excuse me for my suspicious. ^^

BTW, he didn't add any genuinely Bulgarian sentences. ^_^

Gruzilkin (http://tatoeba.org/eng/sentence...ser/Gruzilkin) have added some Bulgarian sentences marked as Russian. It's either auto-detection failure or intended action (he brought the number of Bulgarian sentences to 500 this way).
Can we reassign the language of these in a batch?

Some preventive measures?
timsa (http://tatoeba.org/eng/user/profile/timsa) have added lots of sentences that are either not translations (but may look like them for a learner), or too impolite/vulgar, or too colloquial...
I think we need some rules regarding the user behaviour.

Please do the same for non-breaking space [ ]. I sometimes use it to prevent dashes (—) from being moved to the next line; it should be treated in the same way as the ordinary space for search purposes.
There are lots of other spaces, but I’m not sure anyone has ever used these on Tatoeba: en quad [ ], em quad [ ], en space [ ], em space [ ], 3-per-em space [ ], 4-per-em space [ ], 6-per-em space [ ], figure space [ ], medium mathematical space [ ], punctuation space [ ], thin space [ ], hair space [ ], zero-width space []

Actually, I’m not sure it’s the best way of sorting sentences. Shorter sentences tend to be stranger, as there is little context.

This seems to happen with all the batch imports. The Ukrainian proverbs from Shtoota are contributed by sysko and owned by me (422069). I don't see any problem with this.
After all, such things come from collections not neccessarily compiled by the people who have suggested importing them, and IMO there’s nothing bad with admins being higher in the contributors table. They contribute in other ways too.

I suggest not putting the tag on the other people's sentences, or at least writing something about it in the comments. I've been very surprised to find my Russian sentence tagges 'needs native check' recently (http://tatoeba.org/sentences/show/451147).

I don’t think the number of corrections can be equal to unreliability.
In fact, I think what we need is a rating system, which would allow users to vote for the translations they believe to be good.
By the way...
> check if it matches the former version regardless
> of punctuation and spaces (easy!)
Punctuation rules and spaces are also very important. Comma in the wrong place can change the meaning of the sentence more than a misspelled word.

Yet another feature request. Can we add several tags at a time? I.e. «familiar;said to female». It should be relatively easy to implement, but it can save some time and semicolons are very unlikely to appear in tags anyway.

BTW, if I'm not mistaken, Mnemosyne expects CSV fields to be Text<tab>Translation, which is different from what Tatoeba has.

a) This is counter-intuitive for many languages: ‘cmn’ for Chinese MaNdarin, ‘epo’ for EsPerantO, ‘kat’ for Kartuli, ‘nob’ for Norwegian Bokmål, ‘non’ for Norwegian Nynorsk, ‘ron’ for Romanian
b) The icons will be harder to distinguish (as they’re just black-on-white text, unlike flags)
c) Latin script is not neutral. ;) At least not more neutral than flags.

I like this initiative, but its aims seem to be different from what we need.
It’s just one symbol that is intended to mean “Select language” in different colours and sizes.