Wall (6,960 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
marafon
4 days ago
CK
4 days ago
sharptoothed
10 days ago
Cangarejo
10 days ago
Cangarejo
13 days ago
Thanuir
13 days ago
ondo
14 days ago
ddnktr
14 days ago
ondo
14 days ago
AlanF_US
18 days ago
[@sysko and anyone interested]
Uzbek script-switching code is here:
http://uyghur.webatu.com/uzb/uzbek_script.zip
The Cyrillic>Latin should work fine, Latin>Cyrillic may have problems with Russian loanwords. Also, it may be slow, since it’s a simple PHP str_replace w/ 2 arrays.
A simple page for anyone to toy with the transliteration:
http://uyghur.webatu.com/uzb/
There are about 26 sentences with the 'Bosnian' tag. Is that enough for a flag?
Yes. Especially because there are going to be lots, lots more...
Are there any Bosnian and Croatian sentences these NOT matching Serbian?.. :o
Nope :-)
I add all three versions together when I add Bosnian/Croatian.
We really need the flags before the duplicate-script removal starts eliminating Bosnian sentences as duplicates of the Croatian... ;-)
(just kidding, I differentiated all the flags)
(Croatian as well, please)
Quick request.
Could you add a link from the sentence annotations page of an example to the example itself?
e.g.
http://tatoeba.org/eng/sentence...ns/show/118697
should have a link to
http://tatoeba.org/eng/sentences/show/118697
It seems silly that this hasn't been brought up before, but why not add a note about the need for periods and capitalization? I think a lot of new users are just not aware of this point until they get commented on.
We'll be processing sentences to automatically add missing periods, capital letters and fix other typography issues. Sacredceltic had actually brought this up already.
But it's true we can add a note about punctuation and capital letters. I'll append it to the red warning message below the translation form.
BTW, we should be careful when adding sentences with periods.
Sometimes there are non-standard 1337-like ‘punctuation’ like "!!1" in the end.
I’m not sure whether we should keep such sentences though...
I think non-standard punctuation is appropriate when the rest of the example is of the same style. (This applies to the ♪'s used to end some Japanese sentences).
I've got ~3k sentences with the period at the beginning -_-' (help, plz?)
I think looking how long we know each other that I can do it for you :p
can you give an id of a correct position period and a non correct one, because in the database I'm a bit confused about how they display arabic script.
thank you so much sysko
period at the beginning:
http://tatoeba.org/eng/sentences/show/370561
period at the end:
http://tatoeba.org/eng/sentences/show/370569
It sounds like a good idea to me.
Thursday WWWJDIC examples status update.
30 records deleted.
13 records added.
I've been doing a bit of thinking about tags. So far, I've not figured out how to remove a tag from a sentence and browsing the tags page,
http://tatoeba.org/eng/tags/view_all
there seems to have been quite a proliferation since they were introduced. Now that we have a bit of experience with these, it might be a good idea to figure out where we want to take them and how to structure them for that purpose.
I grabbed the list of tags and sorted them out a bit at
http://martin.swift.is/tatoeba/tags.html
A few thoughts:
1) There seem to be a number of empty and duplicate tags. Can Trang or Sysko easily delete these?
2) The @-rule seems to be a good one but some of the tags should be merged (sentences with the native-check tag should probably be orphaned so that a sufficiently proficient speaker can adopt them).
3) The biggest group of tags (the way I sorted them) is the quotations. Seeing how a sentence can just as well be about a person as a quote by that person, I think the "by-" and "from-" prefixes are useful for both clarification and sorting.
4) Tag language and script. I'm as much of an l10n/i18n freak as the next person, but seeing how that will be sorted out in the near(?) future (there was a post about this at some point recently, wasn't there?) I think we should translate/transliterate the non-English ones and stick with that until we can translate tags. We could possibly include the extended Latin character set but even that isn't necessary for the current purpose.
5) Given the sheer number of tags, it might be nice to have these split up along some lines. How that is done is going to be pretty subjective, but I'm sure we can hash something out. Is there a way to add a field on the tags in the database to categorise them?
6) We might want to consider tossing out some of the tags. 'Fruit' may be fine, but 'apples' just seems a bit useless given the search feature.
Finally, I think the tags can be a fantastic tool, but so far they seem a bit disorganised and unwieldy to be used on a mass scale. Some statistics on how many sentences are tagged and how many sentences each tag has would be very interesting to gauge how people are using them.
> (sentences with the native-check tag should probably be orphaned so that a sufficiently proficient speaker can adopt them)
I disagree on this point. There's a reason why it's "Needs Native Check" (rather than "Needs Native Parent"). The creator of the phrase should keep the phrase, and there's no more work in checking than in parenting (in fact, parenting would involve the check anyway).
What's the point in parenting a sentence if you're not confident that it's correct?
There *is* slightly more work (and arguably less incentive) in leaving a note on someone's sentence and for them to respond, than for a "checker" to simply adopt the sentence and fix it themselves.
(Oh, and I don't think the tags system is tried and trusted enough to warrant an appeal to tradition.)
Leaving a comment lets a non-native owing it learn the correct variant or be confident in their translations.
There are multiple points in doing so:
1) The learning principle. A native speaker correcting it serves as a lesson. One side educates, and the other rectifies a certain mistake and probably won't repeat it. If you go for pure efficiency, then you annihilate this aspect, which, IMO, would be a pity. You'd probably want an environment where people can grow. That involves making mistakes and learning.
2) People grow attached to sentences. Absurd, but we do. They're often a reflection of a person's wit, identity, or interests. And if you're only, say, 95% sure of the translation and would like a check, seeing it orphaned and then with another person's name on it is... well, strange.
3) The native check is mostly to make sure that the grammar and the naturalness of the sentence doesn't have any issues, while the originator of the sentence has a responsibility towards all the translations that it's linked to, and how the sentence fits in on the whole. In this sense, it *is* more work. The person adopting the sentence has to be able to guarantee that all the translations remain valid when he fixes.
@FeuDRenais
Also, there was a proposal to write all tags lowercase:
http://tatoeba.org/eng/wall/sho...8#message_1588
> So far, I've not figured out how to remove a tag from a
> sentence
If you are not a moderator, you can only remove tags that you have added. You can't remove others' people tags.
> Now that we have a bit of experience with these, it might
> be a good idea to figure out where we want to take them
> and how to structure them for that purpose.
Thanks for taking the time ^^ CK had also tried to sort the tags: http://a4esl.com/temporary/tatoeba/
(cf. "Some of the Tags That I've Found")
> 1) There seem to be a number of empty and duplicate tags.
> Can Trang or Sysko easily delete these?
Yes.
> 4) Tag language and script. [...] I think we should
> translate/transliterate the non-English ones and stick
> with that until we can translate tags.
It's not always easy to translate, and sometimes a non-English tag may be more appropriate/accurate. For instance if you tag a French sentence with "preterite", it can be ambiguous whether "preterite" refers to the "passé simple" or to the "imparfait". So if you're going to put a tag that refers to a tense, it's probably better to use the name in the language of the sentence. That's one case I can think of but there are probably other cases...
I still agree though, that we should use English tags by default, until we can translate tags.
> 5) Is there a way to add a field on the tags in the
> database to categorise them?
At the moment, no.
> 6) We might want to consider tossing out some of the
> tags. 'Fruit' may be fine, but 'apples' just seems a bit
> useless given the search feature.
I agree.
> Finally, I think the tags can be a fantastic tool, but so
> far they seem a bit disorganised and unwieldy to be used
> on a mass scale.
I agree. We do have more plans for tags, but they're not a priority yet...
> Some statistics on how many sentences are tagged and how
> many sentences each tag has would be very interesting to > gauge how people are using them.
Actually if you want to do some stats, you can use this file (it's not part of the official downloads files yet):
http://tatoeba.org/files/downloads/tags.csv
The fields are: sentence_id, tag_name
It's exported every week.
> If you are not a moderator, you can only remove tags that you have added. You can't remove others' people tags.
Fair enough. :-)
> Thanks for taking the time ^^ CK had also tried to sort the tags: http://a4esl.com/temporary/tatoeba/
> (cf. "Some of the Tags That I've Found")
Interesting.
>> 1) There seem to be a number of empty and duplicate tags.
>> Can Trang or Sysko easily delete these?
>
> Yes.
OK, then let's start by deleting the ones in:
<http://martin.swift.is/tatoeba/tags.html#empty>
and then look at merging duplicates.
Tatoeba, by the way, doesn't seem to show any difference between an empty and non-existent tag.
>> 4) Tag language and script. [...] I think we should
>> translate/transliterate the non-English ones and stick
>> with that until we can translate tags.
>
> It's not always easy to translate, and sometimes a non-English tag may be more appropriate/accurate.
I reckon the best solution is to use whatever term would be used on the English translation once we're able to translate tags.
>> 6) We might want to consider tossing out some of the
>> tags. 'Fruit' may be fine, but 'apples' just seems a bit
>> useless given the search feature.
>
> I agree.
OK, we can have a look at this once we've gotten rid of some of the empty and duplicate categories.
>> Some statistics on how many sentences are tagged and how
>> many sentences each tag has would be very interesting to > gauge how people are using them.
>
> Actually if you want to do some stats, you can use this file (it's not part of the official downloads files yet):
> http://tatoeba.org/files/downloads/tags.csv
Great! Thanks.
I've deleted this morning (GTM +1h) the empty tags
If I've time, I will see to make the difference between a non existent and empty tag (in fact there is, non existent tag show an empty page, and empty tag at least show the name of the tag in title:p )
Next release (tomorow) will show tags sorted by number of tags, and autocompletion when adding a tag, this should make it a bit usefull and avoid most of mistyping / duplicate tags
> I've deleted this morning (GTM +1h) the empty tags
Great! Thanks.
> Next release (tomorow) will show tags sorted by number of tags
I'm actually not convinced that this is terribly useful. My impression is that one would use the tags to find a particular topic (in which case, alphabetical ordering would make the most sense) rather than just any popular one (as one might when visiting a blog; which is why things like tag-clouds are appropriate there).
Seeing how the great number of tags would benefit from categorisation, it might actually be helpful to have a semi-automated list of tags... Tatoeba is currently written in PHP, right?
Seeing how many sentences are filed under a tag would be useful, though.
> and autocompletion when adding a tag
This will be a fantastic addition!
yep it just that I've added the count while adding the autocompletion, to have tag name with more tag appearing first in suggestion. For the moment my personnal live is a bit busy so for sure categorisation will be great, but I didn't found time to do it yet ^^
Some more thought about tags.
BTW, IMHO tags should be as general as possible: not “2nd Person Formal”, but “2nd Person”, “Formal”. This makes them more useful for automated processing.
+1
> So far, I've not figured out how to remove a tag from a sentence
First step - become a moderator.
> So far, I've not figured out how to remove a tag from a sentence
The alghoritm is:
a) add the same tag to any other sentence (temporarily)
b) copy the delete link and replace the sentence number
c) delete your temporary tag
This is a bug. Use only when you’re absolutely sure it’s applicable.
Am I doing something wrong or is there no such function?
I want to search for English sentences that are NOT translated into Russian (so I could translations), and I don't see how to do it
Click on the English flag on the top right corner of the main page. This will give you ALL the English sentences available. Then on the right side you'll see options to narrow down the list. You want to choose the one "Show Sentences Not Directly Translated Into...", and then "Russian". Voilà.
As a side note, you should probably select "Show Translations in...", then "All Languages". Sometimes, there's already a good English-to-Russian translation, but it just hasn't been linked. For efficiency reasons, it might be better to just request that someone with the power to link link the two.
Question Regarding Transliterations:
How much work, and what exact steps, are needed to set up a transliteration system for a specific language? I ask because we have a number of languages now that can be written in multiple alphabets. For many of these, the task seems to be a very simple one, as a one-to-one letter correspondence between the alternative alphabets exists (e.g. Serbian). For some, it is only possible to do it in one way but not the other (unless a dictionary is available), but again, the task should not be a difficult one as the majority of the current entries are inputted on the good side of the one-way (e.g. Uighur, Uzbek, and - I think, but Demetrius could confirm - Tatar).
So, what would one have to do to realize this?
It's easy, but I'm too lazy to do this. ^^ I've started working on Uzbek, maybe I'll finish it someday...
Also, sysko has to do transliteration caching. It will allow making transliteration more time-intensive (dictionary searches...).
> one-to-one letter correspondence between
> the alternative alphabets exists (e.g. Serbian).
But injekcija = инјекциjа, not ињекциjа. So you need a dictionary when transcribing from Latin...
> Uighur
Latin > Arabic is the easiest (unless people omit ' and don't differenciate j/zh ^^).
Since we have no Latin Uyghur sentences, there is no rush.
Others require a LARGE dictionary of proper names, since Arabic has no capital letters. Cyrillic requires a dictionary of Russian loanwords.
> and - I think, but Demetrius could confirm - Tatar
No, this one is tricky in both directions. The hardest part is q and ğ. Usually к = k and г = g w/ front vowels, к = q and г = q with back ones.
But Arabic words break vowel harmony:
нигъмәт — niğmət ‘dish’ (ğ is marked as гъ), сәгать — səğət ‘clock’ (ğ is marked by changing vowel letter, the vowel quality is marked by the soft sign), сәгатем — səğətem ‘my clock’ (ğ is marked by a vowel letter w/out a soft sign)
Russian loanwords break vowel harmony in other way, they force the K and G even near back vowels.
Also, there are W and V:
В = V (вагон — vagon ‘carriage’), W (авыл — awıl ‘village’)
У = U (су — su ‘water’), W (тау = taw ‘mountain’)
Ү = Ü (күрү — kürü ‘see’), W (Мəскəү — Məskəw ‘Moscow’)
> But injekcija = инјекциjа, not ињекциjа. So you need a dictionary when transcribing from Latin...
Good point. But transliteration can't handle letter pair --> single letter correspondence? That would be much easier than a dictionary. Unless there are instances of "nj" that are нј and not њ, but this is never the case. Anyway, there are nearly no Latin Serbian submissions up to now, and so a one-way from the Cyrillic (if that's what it comes down to), is perfectly okay, IMO.
> Uighur
I disagree here. You'd need a big dictionary for Latin to Arabic. One particular example is n+g and ng (نگ and ڭ). The Latin "ng" could be transliterated as either. I think there are other cases, as well (personally, I don't much like the Latin Uighur...)
Arabic to Latin WITHOUT proper names is perfectly all right, in my opinion. It's not perfect, but it would still make a world of difference and people would usually be able to figure out what should be capitalized anyway.
> Good point. But transliteration can't handle
> letter pair --> single letter correspondence?
Of course it can. I mean that usually nj = њ, but in the word injekcija it's a morpheme boundary and it should be retained нј.
> Unless there are instances of
> "nj" that are нј and not њ,
> but this is never the case.
Инјекциjа!
> I disagree here. You'd need a big dictionary
> for Latin to Arabic.
In fact, you don't need a dictionary at all for Latin>Arabic. Only for zh/j and ' if people omit these.
All the other directions you need a dictionary.
> The Latin "ng" could be transliterated as either.
No, as far as I know. Latin requires breaking these with ': ng and n'g.
> I think there are other cases,
> as well (personally, I don't
> much like the Latin Uighur...)
But it's indeed very easy to process.
> Инјекциjа!
I should have said "this is never the case, with the exception of a few foreign words". A small dictionary could be made for those, but the loss isn't great if the transliteration doesn't handle them properly.
> No, as far as I know. Latin requires breaking these with ': ng and n'g.
In that case, it's fine.
> I disagree here. You'd need a big dictionary
> for Latin to Arabic.
I meant Arabic to Latin.
I think what I can do,
have the automatic transliteration tool for each languages
each time a new sentences is added / updated or if missing, we call the tool and store it in a specific table of the database.
if it exists then we just retrieve it from the database.
and maybe add a special page for trusted users, to give them the possibility to edit the stored transliteration,
this way :
your dictionnary will not need to handle eeeeevery case, at least as soon as it handles most common case we can put it even if some particular sentences it will need a manual edit. This way we can complete the dictionnary step by step and make the feature sooner avaiblable.
And when a transliteration will be edited it will be flagged, this way if for some reason we update and regenerate the transliteration, we will not erase the manually edited one (as we can suppose they're right)
I think is the best thing we can do, as anyway for lot of languages, making a 100% correct transliteration tool is a dream (even with long running tool as mecab and adso we reach I think 90%)
OK, that is a good idea. I'll send my (imperfect) Uzbek transliterator on Monday.
for some Demetrius started to work on them, you can see with him what he already done, what is hardly doable etc.
For the other one, either give me a letter to letter/ word to word transliteration file (like origin[tab]transliterate) and I think it will not take me long to integrate it in tatoeba.
I've been seeing if I'll just grow accustomed to the new contribute setup but so far I haven't.
The random sentences had the drawback that occasionally you hit sentences you'd seen before. Getting the latest sentences has the advantage that you more easily spot multiple sentences that can be linked to your translation, but increases the times one bumps into sentences one's already seen (and not translated for whatever reason -- the exclude direct translations feature is great, by the way).
Another advantage of getting sentences in the order they were contributed in is that you translate whole batches of similar sentences making search results more more well rounded even if they cover fewer topics.
How about flipping the order on the contribute page around (or add the option), starting with the oldest sentences? That way one can work ones way through the batch, filtering out already translated sentences and starting after the ones one has chosen not to translate.
As a bonus, these are generally orphaned sentences and often need more attention than the ones active contributors are adding.
You can get orphan sentences by setting "translated into none" :)
For the "several random sentences" page, I will get it back soon, but for the moment this page is slow as hell and was really internally bad designed, but I figured to make it fast so it will soon be as before.
The page which show all the sentences in a language show newest, because this way it favourites collaboration, I add a sentence, it will get more chance to be translated than old one because it will appear on first page for a while (depending of this language activity, not sure "a while" will be very long for Russian or Esperanto :p). But you can go to the oldest one by going clicking on "last" :p (more seriously, yep we can add a revert order button)
Not quite sure what you mean by "translated into none". By "orphaned" I mean the sentences that aren't owned by anyone.
Going to the last page is certainly a workaround but showing the oldest first would make browsing and finding where one left off much easier (as the reference point wouldn't be moving constantly.
The collaboration argument is a very good one, but it would still be nice to be able to choose to work on the back-log. :-)
Many thanks!
Would anyone be terribly upset if we got rid of one of these?
http://tatoeba.org/eng/sentences/show/222359
http://tatoeba.org/eng/sentences/show/222367
Not in this case - because the only difference is whether one word is in hiragana or in kanji.