Wall (6,960 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
marafon
3 days ago
CK
3 days ago
sharptoothed
9 days ago
Cangarejo
9 days ago
Cangarejo
12 days ago
Thanuir
12 days ago
ondo
13 days ago
ddnktr
13 days ago
ondo
13 days ago
AlanF_US
16 days ago
Goodbye to [M], [F]
Now that the tag system has arrived this is a good opportunity to get rid of the [F] and [M] tags. Please leave the others in place for now. I will email lists of sentence IDs for sentences you can apply 'female' and 'male' tags to.
Any progress on the male / female tag import?
There's no rush, but it would be nice if we could have it sorted out by the Saturday update for Jim.
Importing the tags should be done by Saturday.
Removing the [M] and [F] can be done by Saturday but I prefer not to rush on this.
Exporting the tags can be a bit tricky but I'll send an email to Jim to tell him what I can easily do, and see if that's fine with him.
What about the [XXX] sentences? I started taking them out of the sentence and add a real tag instead. But since we don't export tags (yet) in our downloads files, I figured perhaps you still need to keep the tag in the sentence?
If you don't, then you can just erase the XXX
=> http://tatoeba.org/eng/tags/sho...s_with_tag/XXX
If you do need them, then you can add back the [XXX] where I took it off.
I don't think we should remove other tags yet. I started with [M] and [F] because (a) they are automatically generated so it was easy to update them and (b) they are not that important.
In particular I don't think taking the [XXX] tags out now is a good idea because there is still no way for users to filter them out. At least seeing the [XXX] tells users that we know it's a dodgy sentence.
OK, I've sent in the female / male tag list. The next question will be how to process them for WWWJDIC. You should check with Jim for that - although you could just add the tags to the end of their entries in WWWJDIC.csv with [ ] added around them.
I for one welcome our new tag system.
I hope you guys&girls do think about us who we later need to implement some logic on the data. While tags add important knowledge for humans I'm still not sure how to use this information when automatically selecting translations.
I do think though that "our" data is safe in your hands. Thanks for the good job!
sysko and I were pondering on what exactly you meant by "I hope you guys&girls do think about us who we later need to implement some logic on the data".
Is there anything specific we should think about? :)
He, thanks for asking.
My current perspective is how to select good sentences based on a given word. So if I have the word "love" I'll probably have several English sentences that include this word. Now if English had different pronouns depending on gender I would have two versions of each sentence, e.g. "I love you (male)" and "I love you (female)". Now think of even more specific tags, like "literal" and "metaphorical" translation. Which would provide the best examples? Probably the latter one (see discussion here http://tatoeba.org/deu/sentence...334#comments). "Literal" translations really are just that, translations. They probably don't make good example sentences. A good logic would need to take care of that.
Me as a programmer I would need to have the answers here (I'm of course speaking from the perspective of integrating Tatoeba with Eclectus, sysko should know :). While tags provide a good logic for humans, it does probably complicate things here for me (and most probably others).
So my basic point is, don't forget the machine readable side over the human readable website.
Not sure if you pondered all the cases where Tatoeba data could be used. But maybe such an analysis could assist your design decisions.
[not needed anymore- removed by CK]
Until we have a "links" section, perhaps you can simply put these kind of links in your profile.
It will certainly make them easier to find than if they get buried in the Wall.
Trang or sysko, the "csv" export still has leading \n and \t escaped with a '\'. For example, French sentences from 181804 to 181859. Please use some trim function :D.
[@moderator] Should be deleted
This list can now be substituted by adding the tag "to delete" to the sentence. You should still add a comment explaining why.
Note that the moderators may not speak the same languages you do - so an additional explanation in English may speed things up. :-)
Actually I just thought of one problem - there isn't a way to search for tags yet, is there?
[not needed anymore- removed by CK]
Yeah, should be a little careful applying proverb tags to sentences with [proverb]. They may have [proverb] on non-proverb English sentences that are translations of Japanese proverbs.
Tags!
http://blog.tatoeba.org/2010/06...12th-2010.html
To quote myself:
"We count on everyone to try and help us figure out what works best. Feel free to discuss about issues related to tags on the Wall."
[not needed anymore- removed by CK]
Concerning "checked by X people" => yes, it's something we have thought of, but for technical reasons we didn't do it. sysko may tell you more about it, but at any rate, it was not very urgent.
Concerning the the pre-defined tags, we also thought of it. I forgot to talk about it in the blog post... But I'm still unsure of what to put in the list of pre-defined tags. I actually first want to let all users tag with whatever they want.
I also remember I forgot to talk about auto-completion. Well in general there are lots of things we can do with tags =)
I'm okay with using "OK" instead of "checked". Although I'm not totally sure if people will understand that it means the sentence has been proofread, but it's shorter and we can later add descriptions to tags. We can also change the name if it turns out to be too confusing.
[not needed anymore- removed by CK]
[not needed anymore- removed by CK]
Well, the definition of "sentence" is still not exactly clear, to me at least. I mean, sometimes a "sentence" can be just a word... Like "Hello".
One thing is sure: you should avoid adding something that is clearly a partially formed sentences. Instead of "to be in love", you should add "He is in love" (for instance).
But in general, we are not very strict on the matter of what is accepted or not because we haven't decided yet what is a sentence.
We have only decided that a sentence has punctuation :)
The other problem is that I actually wouldn't even what word to use instead of "sentence"...
> The other problem is that I actually wouldn't even
> what word to use instead of "sentence"...
I think 'sentence' is a useful approximation - especially if you take off your grammatician hat. ;-)
I thought tatobea supports my native language(Farsi).
But now I know it was just a thought.
Thanks. So, I start to add some sentences. And after a while, I will call you to add my language to your list.
I think this is better. Because I'm not that active or have no free time.
no problem, even if there's only a douzen, it's enough, most of the times it's enough to attract more people contribute in your language :)
Ok. But could you tell me how to stick a flag?
the flag will be added by us on the interface, in the same time as the lanugage itself
Hi Hamid :)
Which flag do you think should be used?
Iran, Afghanistan...?
Ofcourse Iran (Farsi).
afghanistan is Pashto.
You can add sentences now and correct the flag when it's added to the list. Probably wouldn't take long.
Hi hamid, in fact it's not because your language is not in the list that we don't want it/will not support, it's just one has to know we add a sentence in the list as soon as we have some sentences in it, otherwise the list will be full of hundreds of languages with 0 sentences, which will not be confortable for users
but if you're ready to add some sentences in your language, we will be glad to add it in the list :)
Your language has been added :)
You can now set your sentences to "Persian".
Little note: for the name of the language we used "Persian" and not "Farsi", as Wikipedia says it's "the more widely used name of the language in English".
http://en.wikipedia.org/wiki/Persian_language
I am not sure whether this is worth discussing, but there are some sentences which are really redundant, e.g.
162883, 83091, two rather long sentences which only differ in the subject being "my mom" vs. "my dad".
Shouldn't we remove one of such pairs and concentrate on the gist instead of wasting our efforts on translating countless variants?
[not needed anymore- removed by CK]
Hi CK,
I completely agree with your notion of near duplicates versus clutter.
I think that besides "dealing" with clutter that already exists, we should also put some effort into guidelines about creating new content.
Okay I haven't replied to this yet so I will, to make it clear about "variations" of sentences.
Our position is: people can do whatever they like. If they want to add all the possible variations, they can. If they don't want to, they don't have to.
It doesn't hurt to have "near duplicates". It just make Tatoeba a bit noisy. But that's our job, as engineers, to figure out how to filter and organize data so that it can be used efficiently for language learners.
Meanwhile, as sysko said, variations of sentences can be very useful for language processing, so we shouldn't delete them.
Just to clarify the clarification. Near duplicates will be removed from WWWJDIC - but not by deleting them from Tatoeba. So feel free to point out Japanese sentences and English sentences linked to Japanese sentences that are near duplicates.
Hi Paul, I saw you always post a comment "Not for WWWJDIC" in each sentence. Shouldn't that be solved by using tags?
I could, but I started doing that before tags existed.
It also gives people a chance to notice what sentences I'm excluding and ask why (or just complain ;-).
So do you filter the sentences according to your comment, or do you mark them somewhere else AND put a comment in?
I just want to know how we should approach sentences we find should not appear there (e.g. hiragana-kanji variants of exactly the same sentence.)
In the secret sentence annotation page, where the Japanese index can be entered / edited, I put -1 in the meaning field.
No one else can see that so the note is just to let people know what I'm doing (generally excluding near-duplicate sentences from WWWJDIC).
In an other side I'm working with an other guy on a machine-learning based automated translator, and this kind of "near" duplicate sentences are REALLY usefull
in fact as a learner I also like to find sometimes this kind of sentences where only a part change, it's easier to see some grammar point this way (because for example in French sentences changing a "my mom" by "my dad" could change the verbs / adjectiv and so in the sentences, which is always interesting to see this variation on the same sentence)
On this point, I've chosen to add these nuances in comments. There are otherwise just going to be way too many similar sentences.
> Shouldn't we remove one of such pairs and concentrate on
> the gist instead of wasting our efforts on translating
> countless variants?
There is a constant effort to remove near - duplicates. At the current rate we're probably losing a couple of dozen a week, if not more.
However removing duplicates does not produce _new_ content. And new content is what's needed to fill out Tatoeba and make it more appealing.
Yes, you are right, producing new content is also important, though I as a native German speaker am right now mostly busy with adding German translations to the already existing Jap-Eng. sentence pairs. And that's when I came across these near-duplicates.
Currently I am thinking about how I could involve my Japanese language exchange partner to produce some content. At least, I will check with her some sentences I found dubious.
So how would be the best procedure if I come across such a sentence pair? Make a comment? Add it to the "mark for deletion" list?
moreover I think here the problem is not to have or not this countless variant (for the reasons below I would prefer to keep them), but rather "how to show to contributors only 'usefull' sentences"
Please could we have the duplicate removal script run soon? (Before Saturday, anyway)