Wall (6,005 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
14 hours ago
20 hours ago
22 hours ago
3 days ago
3 days ago
3 days ago
I've been doing a bit of thinking about tags. So far, I've not figured out how to remove a tag from a sentence and browsing the tags page,
there seems to have been quite a proliferation since they were introduced. Now that we have a bit of experience with these, it might be a good idea to figure out where we want to take them and how to structure them for that purpose.
I grabbed the list of tags and sorted them out a bit at
A few thoughts:
1) There seem to be a number of empty and duplicate tags. Can Trang or Sysko easily delete these?
2) The @-rule seems to be a good one but some of the tags should be merged (sentences with the native-check tag should probably be orphaned so that a sufficiently proficient speaker can adopt them).
3) The biggest group of tags (the way I sorted them) is the quotations. Seeing how a sentence can just as well be about a person as a quote by that person, I think the "by-" and "from-" prefixes are useful for both clarification and sorting.
4) Tag language and script. I'm as much of an l10n/i18n freak as the next person, but seeing how that will be sorted out in the near(?) future (there was a post about this at some point recently, wasn't there?) I think we should translate/transliterate the non-English ones and stick with that until we can translate tags. We could possibly include the extended Latin character set but even that isn't necessary for the current purpose.
5) Given the sheer number of tags, it might be nice to have these split up along some lines. How that is done is going to be pretty subjective, but I'm sure we can hash something out. Is there a way to add a field on the tags in the database to categorise them?
6) We might want to consider tossing out some of the tags. 'Fruit' may be fine, but 'apples' just seems a bit useless given the search feature.
Finally, I think the tags can be a fantastic tool, but so far they seem a bit disorganised and unwieldy to be used on a mass scale. Some statistics on how many sentences are tagged and how many sentences each tag has would be very interesting to gauge how people are using them.
> (sentences with the native-check tag should probably be orphaned so that a sufficiently proficient speaker can adopt them)
I disagree on this point. There's a reason why it's "Needs Native Check" (rather than "Needs Native Parent"). The creator of the phrase should keep the phrase, and there's no more work in checking than in parenting (in fact, parenting would involve the check anyway).
What's the point in parenting a sentence if you're not confident that it's correct?
There *is* slightly more work (and arguably less incentive) in leaving a note on someone's sentence and for them to respond, than for a "checker" to simply adopt the sentence and fix it themselves.
(Oh, and I don't think the tags system is tried and trusted enough to warrant an appeal to tradition.)
Leaving a comment lets a non-native owing it learn the correct variant or be confident in their translations.
There are multiple points in doing so:
1) The learning principle. A native speaker correcting it serves as a lesson. One side educates, and the other rectifies a certain mistake and probably won't repeat it. If you go for pure efficiency, then you annihilate this aspect, which, IMO, would be a pity. You'd probably want an environment where people can grow. That involves making mistakes and learning.
2) People grow attached to sentences. Absurd, but we do. They're often a reflection of a person's wit, identity, or interests. And if you're only, say, 95% sure of the translation and would like a check, seeing it orphaned and then with another person's name on it is... well, strange.
3) The native check is mostly to make sure that the grammar and the naturalness of the sentence doesn't have any issues, while the originator of the sentence has a responsibility towards all the translations that it's linked to, and how the sentence fits in on the whole. In this sense, it *is* more work. The person adopting the sentence has to be able to guarantee that all the translations remain valid when he fixes.
Also, there was a proposal to write all tags lowercase:
> So far, I've not figured out how to remove a tag from a
If you are not a moderator, you can only remove tags that you have added. You can't remove others' people tags.
> Now that we have a bit of experience with these, it might
> be a good idea to figure out where we want to take them
> and how to structure them for that purpose.
Thanks for taking the time ^^ CK had also tried to sort the tags: http://a4esl.com/temporary/tatoeba/
(cf. "Some of the Tags That I've Found")
> 1) There seem to be a number of empty and duplicate tags.
> Can Trang or Sysko easily delete these?
> 4) Tag language and script. [...] I think we should
> translate/transliterate the non-English ones and stick
> with that until we can translate tags.
It's not always easy to translate, and sometimes a non-English tag may be more appropriate/accurate. For instance if you tag a French sentence with "preterite", it can be ambiguous whether "preterite" refers to the "passé simple" or to the "imparfait". So if you're going to put a tag that refers to a tense, it's probably better to use the name in the language of the sentence. That's one case I can think of but there are probably other cases...
I still agree though, that we should use English tags by default, until we can translate tags.
> 5) Is there a way to add a field on the tags in the
> database to categorise them?
At the moment, no.
> 6) We might want to consider tossing out some of the
> tags. 'Fruit' may be fine, but 'apples' just seems a bit
> useless given the search feature.
> Finally, I think the tags can be a fantastic tool, but so
> far they seem a bit disorganised and unwieldy to be used
> on a mass scale.
I agree. We do have more plans for tags, but they're not a priority yet...
> Some statistics on how many sentences are tagged and how
> many sentences each tag has would be very interesting to > gauge how people are using them.
Actually if you want to do some stats, you can use this file (it's not part of the official downloads files yet):
The fields are: sentence_id, tag_name
It's exported every week.
> If you are not a moderator, you can only remove tags that you have added. You can't remove others' people tags.
Fair enough. :-)
> Thanks for taking the time ^^ CK had also tried to sort the tags: http://a4esl.com/temporary/tatoeba/
> (cf. "Some of the Tags That I've Found")
>> 1) There seem to be a number of empty and duplicate tags.
>> Can Trang or Sysko easily delete these?
OK, then let's start by deleting the ones in:
and then look at merging duplicates.
Tatoeba, by the way, doesn't seem to show any difference between an empty and non-existent tag.
>> 4) Tag language and script. [...] I think we should
>> translate/transliterate the non-English ones and stick
>> with that until we can translate tags.
> It's not always easy to translate, and sometimes a non-English tag may be more appropriate/accurate.
I reckon the best solution is to use whatever term would be used on the English translation once we're able to translate tags.
>> 6) We might want to consider tossing out some of the
>> tags. 'Fruit' may be fine, but 'apples' just seems a bit
>> useless given the search feature.
> I agree.
OK, we can have a look at this once we've gotten rid of some of the empty and duplicate categories.
>> Some statistics on how many sentences are tagged and how
>> many sentences each tag has would be very interesting to > gauge how people are using them.
> Actually if you want to do some stats, you can use this file (it's not part of the official downloads files yet):
I've deleted this morning (GTM +1h) the empty tags
If I've time, I will see to make the difference between a non existent and empty tag (in fact there is, non existent tag show an empty page, and empty tag at least show the name of the tag in title:p )
Next release (tomorow) will show tags sorted by number of tags, and autocompletion when adding a tag, this should make it a bit usefull and avoid most of mistyping / duplicate tags
> I've deleted this morning (GTM +1h) the empty tags
> Next release (tomorow) will show tags sorted by number of tags
I'm actually not convinced that this is terribly useful. My impression is that one would use the tags to find a particular topic (in which case, alphabetical ordering would make the most sense) rather than just any popular one (as one might when visiting a blog; which is why things like tag-clouds are appropriate there).
Seeing how the great number of tags would benefit from categorisation, it might actually be helpful to have a semi-automated list of tags... Tatoeba is currently written in PHP, right?
Seeing how many sentences are filed under a tag would be useful, though.
> and autocompletion when adding a tag
This will be a fantastic addition!
yep it just that I've added the count while adding the autocompletion, to have tag name with more tag appearing first in suggestion. For the moment my personnal live is a bit busy so for sure categorisation will be great, but I didn't found time to do it yet ^^
Some more thought about tags.
BTW, IMHO tags should be as general as possible: not “2nd Person Formal”, but “2nd Person”, “Formal”. This makes them more useful for automated processing.
> So far, I've not figured out how to remove a tag from a sentence
First step - become a moderator.
> So far, I've not figured out how to remove a tag from a sentence
The alghoritm is:
a) add the same tag to any other sentence (temporarily)
b) copy the delete link and replace the sentence number
c) delete your temporary tag
This is a bug. Use only when you’re absolutely sure it’s applicable.
Am I doing something wrong or is there no such function?
I want to search for English sentences that are NOT translated into Russian (so I could translations), and I don't see how to do it
Click on the English flag on the top right corner of the main page. This will give you ALL the English sentences available. Then on the right side you'll see options to narrow down the list. You want to choose the one "Show Sentences Not Directly Translated Into...", and then "Russian". Voilà.
As a side note, you should probably select "Show Translations in...", then "All Languages". Sometimes, there's already a good English-to-Russian translation, but it just hasn't been linked. For efficiency reasons, it might be better to just request that someone with the power to link link the two.
Question Regarding Transliterations:
How much work, and what exact steps, are needed to set up a transliteration system for a specific language? I ask because we have a number of languages now that can be written in multiple alphabets. For many of these, the task seems to be a very simple one, as a one-to-one letter correspondence between the alternative alphabets exists (e.g. Serbian). For some, it is only possible to do it in one way but not the other (unless a dictionary is available), but again, the task should not be a difficult one as the majority of the current entries are inputted on the good side of the one-way (e.g. Uighur, Uzbek, and - I think, but Demetrius could confirm - Tatar).
So, what would one have to do to realize this?
It's easy, but I'm too lazy to do this. ^^ I've started working on Uzbek, maybe I'll finish it someday...
Also, sysko has to do transliteration caching. It will allow making transliteration more time-intensive (dictionary searches...).
> one-to-one letter correspondence between
> the alternative alphabets exists (e.g. Serbian).
But injekcija = инјекциjа, not ињекциjа. So you need a dictionary when transcribing from Latin...
Latin > Arabic is the easiest (unless people omit ' and don't differenciate j/zh ^^).
Since we have no Latin Uyghur sentences, there is no rush.
Others require a LARGE dictionary of proper names, since Arabic has no capital letters. Cyrillic requires a dictionary of Russian loanwords.
> and - I think, but Demetrius could confirm - Tatar
No, this one is tricky in both directions. The hardest part is q and ğ. Usually к = k and г = g w/ front vowels, к = q and г = q with back ones.
But Arabic words break vowel harmony:
нигъмәт — niğmət ‘dish’ (ğ is marked as гъ), сәгать — səğət ‘clock’ (ğ is marked by changing vowel letter, the vowel quality is marked by the soft sign), сәгатем — səğətem ‘my clock’ (ğ is marked by a vowel letter w/out a soft sign)
Russian loanwords break vowel harmony in other way, they force the K and G even near back vowels.
Also, there are W and V:
В = V (вагон — vagon ‘carriage’), W (авыл — awıl ‘village’)
У = U (су — su ‘water’), W (тау = taw ‘mountain’)
Ү = Ü (күрү — kürü ‘see’), W (Мəскəү — Məskəw ‘Moscow’)
> But injekcija = инјекциjа, not ињекциjа. So you need a dictionary when transcribing from Latin...
Good point. But transliteration can't handle letter pair --> single letter correspondence? That would be much easier than a dictionary. Unless there are instances of "nj" that are нј and not њ, but this is never the case. Anyway, there are nearly no Latin Serbian submissions up to now, and so a one-way from the Cyrillic (if that's what it comes down to), is perfectly okay, IMO.
I disagree here. You'd need a big dictionary for Latin to Arabic. One particular example is n+g and ng (نگ and ڭ). The Latin "ng" could be transliterated as either. I think there are other cases, as well (personally, I don't much like the Latin Uighur...)
Arabic to Latin WITHOUT proper names is perfectly all right, in my opinion. It's not perfect, but it would still make a world of difference and people would usually be able to figure out what should be capitalized anyway.
> Good point. But transliteration can't handle
> letter pair --> single letter correspondence?
Of course it can. I mean that usually nj = њ, but in the word injekcija it's a morpheme boundary and it should be retained нј.
> Unless there are instances of
> "nj" that are нј and not њ,
> but this is never the case.
> I disagree here. You'd need a big dictionary
> for Latin to Arabic.
In fact, you don't need a dictionary at all for Latin>Arabic. Only for zh/j and ' if people omit these.
All the other directions you need a dictionary.
> The Latin "ng" could be transliterated as either.
No, as far as I know. Latin requires breaking these with ': ng and n'g.
> I think there are other cases,
> as well (personally, I don't
> much like the Latin Uighur...)
But it's indeed very easy to process.
I should have said "this is never the case, with the exception of a few foreign words". A small dictionary could be made for those, but the loss isn't great if the transliteration doesn't handle them properly.
> No, as far as I know. Latin requires breaking these with ': ng and n'g.
In that case, it's fine.
> I disagree here. You'd need a big dictionary
> for Latin to Arabic.
I meant Arabic to Latin.
I think what I can do,
have the automatic transliteration tool for each languages
each time a new sentences is added / updated or if missing, we call the tool and store it in a specific table of the database.
if it exists then we just retrieve it from the database.
and maybe add a special page for trusted users, to give them the possibility to edit the stored transliteration,
this way :
your dictionnary will not need to handle eeeeevery case, at least as soon as it handles most common case we can put it even if some particular sentences it will need a manual edit. This way we can complete the dictionnary step by step and make the feature sooner avaiblable.
And when a transliteration will be edited it will be flagged, this way if for some reason we update and regenerate the transliteration, we will not erase the manually edited one (as we can suppose they're right)
I think is the best thing we can do, as anyway for lot of languages, making a 100% correct transliteration tool is a dream (even with long running tool as mecab and adso we reach I think 90%)
OK, that is a good idea. I'll send my (imperfect) Uzbek transliterator on Monday.
for some Demetrius started to work on them, you can see with him what he already done, what is hardly doable etc.
For the other one, either give me a letter to letter/ word to word transliteration file (like origin[tab]transliterate) and I think it will not take me long to integrate it in tatoeba.
I've been seeing if I'll just grow accustomed to the new contribute setup but so far I haven't.
The random sentences had the drawback that occasionally you hit sentences you'd seen before. Getting the latest sentences has the advantage that you more easily spot multiple sentences that can be linked to your translation, but increases the times one bumps into sentences one's already seen (and not translated for whatever reason -- the exclude direct translations feature is great, by the way).
Another advantage of getting sentences in the order they were contributed in is that you translate whole batches of similar sentences making search results more more well rounded even if they cover fewer topics.
How about flipping the order on the contribute page around (or add the option), starting with the oldest sentences? That way one can work ones way through the batch, filtering out already translated sentences and starting after the ones one has chosen not to translate.
As a bonus, these are generally orphaned sentences and often need more attention than the ones active contributors are adding.
You can get orphan sentences by setting "translated into none" :)
For the "several random sentences" page, I will get it back soon, but for the moment this page is slow as hell and was really internally bad designed, but I figured to make it fast so it will soon be as before.
The page which show all the sentences in a language show newest, because this way it favourites collaboration, I add a sentence, it will get more chance to be translated than old one because it will appear on first page for a while (depending of this language activity, not sure "a while" will be very long for Russian or Esperanto :p). But you can go to the oldest one by going clicking on "last" :p (more seriously, yep we can add a revert order button)
Not quite sure what you mean by "translated into none". By "orphaned" I mean the sentences that aren't owned by anyone.
Going to the last page is certainly a workaround but showing the oldest first would make browsing and finding where one left off much easier (as the reference point wouldn't be moving constantly.
The collaboration argument is a very good one, but it would still be nice to be able to choose to work on the back-log. :-)
Not in this case - because the only difference is whether one word is in hiragana or in kanji.
Duplicate removal script not reassigning 'meaning' fields?
It looks like that when sentences like
are removed by the duplicate removal script the Japanese sentence it is attached to
does not automatically have the link to the old sentence location removed. It also doesn't automatically have the meaning field changed to point to the new sentence
Is this something recent, or could it have happen during older an older run of the duplicate removal script?
It must have happened since last Saturday (because I check every Saturday and Thursday).
I would check for links that go to sentences that don't exist.
ok I will check what goes wrong in the script
Some preventive measures?
timsa (http://tatoeba.org/eng/user/profile/timsa) have added lots of sentences that are either not translations (but may look like them for a learner), or too impolite/vulgar, or too colloquial...
I think we need some rules regarding the user behaviour.
I brought up a similar point before, but this is different. When a user productively contributes bad sentences, it is a major problem (it's not always easy to detect, and "volunteer" contributors don't have time to police each other), and so I agree with Demetrius that some sort of system needs to be in place.
I would agree with what Trang has said before, in that banning users isn't the solution (not unless Tatoeba makes strict application procedures, which would probably do more harm than good).
The only "smart", "automatic" way for these kinds of problems to be handled, IMO, is to set up some sort of quota system, where users can contribute more sentences as their "trustworthiness" increases (based on feedback from other users). I have also proposed this before as an automatic means of regulating sentence quality. But it's probably a pipe dream, as it may be hard to code, and probably couldn't be implemented for a long, long while...
> The only "smart", "automatic" way for these kinds of
> problems to be handled, IMO, is to set up some sort of
> quota system, where users can contribute more sentences
> as their "trustworthiness" increases (based on feedback
> from other users).
I'm not keen on that approach. There are a number of
technical and policy problems. On the technical side,
there are users who are the only posters in a certain
language (like Hindu). Nobody would be able to tell
whether they were trustworthy or not and there would be
no feedback to give them a larger quota! I also don't
want to discourage new users from being enthusiastic when
they start out.
You're right. Those are also the reasons why I'm a bit conflicted about it. My only comments:
1) On keen users
Yes, I think a lot of us have been there. I would have hated it if people told me I could, say, only contribute 10 sentences/day when I started. At the same time, when you have bad keen users, you do run into the double-edged problem. Perhaps moderators could play a key role here, and manually raise the quota for users who request it (under the condition that it be lowered back if the trust is betrayed).
2) On "exotic" languages
Similar comment. Perhaps a mod could raise the quota and let the user contribute, then lower it if doubt arises for whatever reason. Also the same double-edged problem, too. What if a user contributes completely bogus sentences in some exotic language? Well, that's less likely, but still...
I think we really just need a few more moderators - particularly moderators that can 'cover' a language that isn't well covered now. I can only really do a proper job of moderation in English and Japanese, for example.
Just out of curiosity, do you have a particular moderator:user ratio in mind?
I'll just say that given the amount of moderator stuff that goes on now, for the amount of moderators and users we have now, then I think we need about 50% more moderators.
My personnal opinion is the same as blay_paul, we need to find more people to be moderators and we need to provide them a set of pages/tools to ease them task. Maybe also add something which may be seen as controversial, the possibility for a moderator (maybe need two moderators), to remove the possibility for a user to add sentences during a short period of time, if a guy go crazy and begins too flood.
This way it's not banning, but it will force someway the guy to take contact with us and to solve the problem diplomaticaly.
After for exotic languages, I think the problem is still the same tough it's even harder to find moderators, but I think, maybe I'm just dreaming, the more exotic a language is, the more probable people will contribute with a good reason in mind, not for flooding or so, maybe with errors etc. but at least not intentionnaly.
When it comes to regulating behaviors, the community plays a very important role.
You can find most of the "rules" on how to behave properly in Tatoeba in the articles here:
If you have read the 3 articles, you know pretty much what is a good and a bad behavior, what is acceptable or not. When people don't follow the guidelines, all you can do is be patient and try to teach them.
There will always be people who don't understand the system, or who have a different point of view of the project. They will not behave the way we may expect them to, they will not be using Tatoeba as we may expect them to. It's impossible to prevent people from breaking the system (whether it's intentionally or not).
What we can (and will) do though is try to design a system that can detect unwanted behavior as early as possible. Then as sysko said, we just need to contact the person. Most people are cooperative and re-adjust their contributions when we simply and kindly ask them to. Those who don't cooperate are usually people who add a few sentences and simply never come back, and in such cases, we have moderators. Although, as blay_paul said, we need more moderators to cover all the important languages... If you think you can take this responsibility for the Russian part, let me know :)
Gruzilkin (http://tatoeba.org/eng/sentence...ser/Gruzilkin) have added some Bulgarian sentences marked as Russian. It's either auto-detection failure or intended action (he brought the number of Bulgarian sentences to 500 this way).
Can we reassign the language of these in a batch?
Everything I added was in Russian, I think auto-detection doesn't work well with Russian/Bulgarian
Please excuse me for my suspicious. ^^
BTW, he didn't add any genuinely Bulgarian sentences. ^_^
I can reassign in a batch if you tell me the criteria, all his contributions in Russian are in fact Bulgarian one?
[not needed anymore- removed by CK]
I'm guessing those two sets would probably merge through the german sentence...
As they're not strictly equal there will not be merged automatically. But yep a moderator can link them :)
Finding new sentences:
When there is a new vocabulary I want to learn, I look up example sentences in Tatoeba. I translate the sentences into my native language to get used to its collocations and nuances, and then I add some sentences to my SRS system (anki in my case).
But when I do not find a sentence here, I have great difficulty to find meaningful sentences on the web.
Even if I use the option "in the text of the page", I mostly get meaningless results such as headlines or single words, but not nice sentences. Does anyone have any ideas about how to find sentences in a more efficient way?
The above problem led me to the idea that users should be able to ask for sentences containing certain words, because native speakers might come up with sentences more easily.
Finding good example sentences in text is notoriously difficult.
A corpus-query system I use, called the Sketch Engine, has an option called GDEX ("good examples") which can automatically promote sentence to the top of the result list depending on how much they look like good examples. It uses a couple of heuristic measures such as sentence length, the presence or absence of punctuation, the presence or absence of anaphoric pronouns, and so on. It is far from perfect but it is a good start.
This sounds nice. Actually I used a similar system for looking up English sentences, namely the BYU system for browsing the British National Corpus, freely available for students:
But does such a system exist for Japanese corpora?
I mean, besides the search facility, the corpora are the other essential ingredient. That's why I tried to use Google, cause they do have the data.
I signed up for the 30 day free trial on Sketch Engine, but I couldn't find the GDEX option. I think it may not exist for the Japanese corpus. It also looks like the Japanese data is not very well filtered to eliminate spam sites and such. It's basically all taken off the web so it isn't much better than Google in that respect. If you search on a spam-worthy word like ヘルス you'll see how bad it can get (NOTE! Search result may not be work friendly.)
The GDEX option is well hidden in the Sketch Engine. When you're looking at a concordance, click "View Options" at the top left and then you'll see a checkbox titled "Sort good dictionary examples". This has to be checked for GDEX to be used.
As far as I know this option is available for every corpus in every language but it may not work so well on languages other than English because -- no surprise -- it was originally built for English.
Found it. It does actually work quite well.
for sure it's something we want for a long time, we will make it "good" and smart in the next BIG release (not the next one , the next BIG)
but maybe I can try to make something quick and dirty,
I will see, talk with trang , and give you an answer.
but in the end it will be done
> Does anyone have any ideas about how to find sentences
> in a more efficient way?
This is something I do a lot. Unfortunately I don't have
any magic solutions. Here's one approach that might work
for you, though.
1. Start with a word (e.g. 潔癖)
2. Look it up in the sort of dictionary that gives example
(Don't trust dictionary example sentence completely!)
Use the context given in the example sentence to narrow
down the Google search (e.g. search on "潔癖" + "手を洗う").
This will usually get you much better results than just
searching on the word alone.
Here are a few I found ...