{{}} No language found.
{{}} No language found.

Wall (5495 threads)

2010-06-13 00:45

To quote myself:
"We count on everyone to try and help us figure out what works best. Feel free to discuss about issues related to tags on the Wall."
hide replies
2010-06-13 02:13
Perhaps in addition to having user-defined tags, you could also create a a list of admin-defined tags and offer them in a SELECT OPTIONS pull down menu. ... and/or ... offer a few basic tags as clickable tags, in other words, a user could click a tag and it would appear in the tag input box.

This would help standardize tags and avoid spelling errors. This could also make all standard tags in English, which may or may not be a good idea.

Another good thing about pre-defined tags would be that you could then allow people who are not trusted_users to immediately use these pre-defined tags, even though you may only want to allow trusted_users to add user-defined tags.

Here are a few suggested pre-defined tags.

** by quality, acceptability, ...
checked (or perhaps "proofread", or just "OK")

** by topic **

Perhaps other people can offer more suggestions.
hide replies
2010-06-14 22:35
Concerning "checked by X people" => yes, it's something we have thought of, but for technical reasons we didn't do it. sysko may tell you more about it, but at any rate, it was not very urgent.

Concerning the the pre-defined tags, we also thought of it. I forgot to talk about it in the blog post... But I'm still unsure of what to put in the list of pre-defined tags. I actually first want to let all users tag with whatever they want.

I also remember I forgot to talk about auto-completion. Well in general there are lots of things we can do with tags =)

I'm okay with using "OK" instead of "checked". Although I'm not totally sure if people will understand that it means the sentence has been proofread, but it's shorter and we can later add descriptions to tags. We can also change the name if it turns out to be too confusing.
2010-06-13 02:15
Though it would require a bit more programming, perhaps you could get the "checked" tag to be similar to the Facebook or YouTube "Like." That is, get it to say "Checked by 3 people" or "Checked by 7 people" etc.

Then users could feel fairly certain that a sentence was good if there were 2 or more "likes."

This way, you could sort your sentences by number of people who have checked a sentence, then perhaps only put a certain number of those sentences into the group of sentences that show as "random sentences."
2010-06-12 16:39
* Sentences vs. Non-sentences *

At first, I understood that wanted sentences only, but a couple of members have told me that non-sentences are OK, too.

What is the policy?

If non-sentences are allowed, then perhaps Tatoeba needs to change some of the wording on the website.

"Collecting example sentences"
"Random sentence"
"Add new sentences"
"Translate sentences"
"Check and correct sentences"
hide replies
2010-06-13 00:40
Well, the definition of "sentence" is still not exactly clear, to me at least. I mean, sometimes a "sentence" can be just a word... Like "Hello".

One thing is sure: you should avoid adding something that is clearly a partially formed sentences. Instead of "to be in love", you should add "He is in love" (for instance).

But in general, we are not very strict on the matter of what is accepted or not because we haven't decided yet what is a sentence.

We have only decided that a sentence has punctuation :)

The other problem is that I actually wouldn't even what word to use instead of "sentence"...
hide replies
2010-06-13 06:24
> The other problem is that I actually wouldn't even
> what word to use instead of "sentence"...

I think 'sentence' is a useful approximation - especially if you take off your grammatician hat. ;-)
2010-06-09 11:09
I thought tatobea supports my native language(Farsi).
But now I know it was just a thought.
hide replies
2010-06-09 14:31
Thanks. So, I start to add some sentences. And after a while, I will call you to add my language to your list.
I think this is better. Because I'm not that active or have no free time.
hide replies
2010-06-09 14:33
no problem, even if there's only a douzen, it's enough, most of the times it's enough to attract more people contribute in your language :)
hide replies
2010-06-09 14:47
Ok. But could you tell me how to stick a flag?
hide replies
2010-06-09 14:50
the flag will be added by us on the interface, in the same time as the lanugage itself
2010-06-09 14:39
Hi Hamid :)
Which flag do you think should be used?
Iran, Afghanistan...?
hide replies
2010-06-09 14:43
Ofcourse Iran (Farsi).
afghanistan is Pashto.
2010-06-09 11:31
You can add sentences now and correct the flag when it's added to the list. Probably wouldn't take long.
2010-06-09 13:14
Hi hamid, in fact it's not because your language is not in the list that we don't want it/will not support, it's just one has to know we add a sentence in the list as soon as we have some sentences in it, otherwise the list will be full of hundreds of languages with 0 sentences, which will not be confortable for users
but if you're ready to add some sentences in your language, we will be glad to add it in the list :)
2010-06-13 00:28
Your language has been added :)

You can now set your sentences to "Persian".

Little note: for the name of the language we used "Persian" and not "Farsi", as Wikipedia says it's "the more widely used name of the language in English".
2010-06-03 08:12
I am not sure whether this is worth discussing, but there are some sentences which are really redundant, e.g.
162883, 83091, two rather long sentences which only differ in the subject being "my mom" vs. "my dad".
Shouldn't we remove one of such pairs and concentrate on the gist instead of wasting our efforts on translating countless variants?
hide replies
2010-06-12 03:31
Some near duplicates are good since you can compare them, some near duplicates are perhaps just clutter.

* Examples of what may be considered good near duplicates:
- He studies English in the early morning.
- She studies French in the late afternoon.

* Examples of what may be considered to just be clutter.
- He studies English.
- She studies English.
- John studies English
- Fred studies English.
- That man studies English.
- That woman studies English.

Perhaps sentences of the "clutter" type need to be dealt with.
hide replies
2010-06-12 10:43
Hi CK,
I completely agree with your notion of near duplicates versus clutter.
I think that besides "dealing" with clutter that already exists, we should also put some effort into guidelines about creating new content.
2010-06-11 19:08
Okay I haven't replied to this yet so I will, to make it clear about "variations" of sentences.

Our position is: people can do whatever they like. If they want to add all the possible variations, they can. If they don't want to, they don't have to.

It doesn't hurt to have "near duplicates". It just make Tatoeba a bit noisy. But that's our job, as engineers, to figure out how to filter and organize data so that it can be used efficiently for language learners.

Meanwhile, as sysko said, variations of sentences can be very useful for language processing, so we shouldn't delete them.
hide replies
2010-06-11 19:43
Just to clarify the clarification. Near duplicates will be removed from WWWJDIC - but not by deleting them from Tatoeba. So feel free to point out Japanese sentences and English sentences linked to Japanese sentences that are near duplicates.
hide replies
2010-06-19 13:26
Hi Paul, I saw you always post a comment "Not for WWWJDIC" in each sentence. Shouldn't that be solved by using tags?
hide replies
2010-06-19 13:33
I could, but I started doing that before tags existed.

It also gives people a chance to notice what sentences I'm excluding and ask why (or just complain ;-).
hide replies
2010-06-19 13:53
So do you filter the sentences according to your comment, or do you mark them somewhere else AND put a comment in?
I just want to know how we should approach sentences we find should not appear there (e.g. hiragana-kanji variants of exactly the same sentence.)
hide replies
2010-06-19 14:08
In the secret sentence annotation page, where the Japanese index can be entered / edited, I put -1 in the meaning field.

No one else can see that so the note is just to let people know what I'm doing (generally excluding near-duplicate sentences from WWWJDIC).
2010-06-03 10:34
In an other side I'm working with an other guy on a machine-learning based automated translator, and this kind of "near" duplicate sentences are REALLY usefull
hide replies
2010-06-03 10:37
in fact as a learner I also like to find sometimes this kind of sentences where only a part change, it's easier to see some grammar point this way (because for example in French sentences changing a "my mom" by "my dad" could change the verbs / adjectiv and so in the sentences, which is always interesting to see this variation on the same sentence)
hide replies
2010-06-03 17:23
On this point, I've chosen to add these nuances in comments. There are otherwise just going to be way too many similar sentences.
2010-06-03 09:24
> Shouldn't we remove one of such pairs and concentrate on
> the gist instead of wasting our efforts on translating
> countless variants?

There is a constant effort to remove near - duplicates. At the current rate we're probably losing a couple of dozen a week, if not more.

However removing duplicates does not produce _new_ content. And new content is what's needed to fill out Tatoeba and make it more appealing.
hide replies
2010-06-03 10:06
Yes, you are right, producing new content is also important, though I as a native German speaker am right now mostly busy with adding German translations to the already existing Jap-Eng. sentence pairs. And that's when I came across these near-duplicates.
Currently I am thinking about how I could involve my Japanese language exchange partner to produce some content. At least, I will check with her some sentences I found dubious.

So how would be the best procedure if I come across such a sentence pair? Make a comment? Add it to the "mark for deletion" list?
2010-06-03 10:44
moreover I think here the problem is not to have or not this countless variant (for the reasons below I would prefer to keep them), but rather "how to show to contributors only 'usefull' sentences"
2010-06-10 13:00
Please could we have the duplicate removal script run soon? (Before Saturday, anyway)
hide replies
2010-06-10 17:32
I would like to ask also a manual update of Launchpad translations (Tatoeba > Launchpad sense) for translating all the new stuff :) merci^^
2010-06-11 00:55
2010-06-10 01:29
Re: × 君〔くん〕 ○ 君〔きみ〕

Perhaps, dropping the MeCab "furigana" and encouraging people who need the reading help to use Firefox with the Rikaichan plugin would help solve this kind of problem (and also perhaps speed up Tatoeba a bit.)

As a programmer, I understand the fun of trying to get all these things working correctly, but as a teacher, I would prefer things to be 100% accurate.

Rikaichan shows both readings and meanings for 君. In addition to that, students also get the EDICT English definitions which can often be useful.

Rikaichan's URL:

The following dictionaries are available for Rikaichan.
hide replies
2010-06-10 11:36
But that would imply that everyone uses Firefox (or has to) ^^

Also, since Tatoeba has a broader public than students in a classroom, it wouldn't necessarily be a good thing to drop the furigana. A lot of people would rather have something that is 80-90% accurate than not having anything at all, because it saves them time. And in the end, it's people's own responsibility to decide whether they want something perfect or not.

Generating furigana is not what slows down Tatoeba the most, you wouldn't see a difference in speed if we took it out, so it wouldn't justify that we take it out.

But as a teacher, you can (and actually *must*) educate your students not to rely on the furigana line, and use Rikaichan instead. It's not incompatible. I use it myself (and always did) whenever I want to figure out the reading of a Japanese sentence in Tatoeba, despite the fact that the reading is already displayed.
2010-06-09 17:05
On reading this comment,, plus seeing sentences that are perhaps not "child friendly" or appropriate for use in school settings, I wonder if perhaps has matured to the point where something similar to Google's "safe search" needs to be considered.

I know that it's quite likely that no one wants to be responsible for deciding where to draw the line. I also have a feeling that some members will see no need to even consider that this is something that may be needed.

However, perhaps one possibility would be to have certain "potentially offensive" sentences only be displayed to members, and also only to members who "opt in" for seeing them.

Consider how a first time visitor may feel if he/she arrives on the main page and an "offensive" sentence is the one that is randomly displayed on the page.

This is something that should probably at least be added to Tatoeba's list of things to be considered.

hide replies
2010-06-09 17:21
once the tags will be added, we will be able to do the following:
tags "unsuitable_for_children" etC. and maintain a list of tags which will be not accesible for non user/ user which has not active "unsafe search" option
2010-06-10 10:24
moi je suis très très petite
2010-06-10 09:21
Why not add a few more search hints?

For example, I tried the following in the search and it worked.
(For those who can't guess what it will get, try it.)


hide replies
2010-06-10 09:33
simply because we haven't found yet time to rewrite the hint page :P
2010-06-09 17:58
If okurigana is incorrect, can we just file this as bugs in MeCab?
hide replies
2010-06-09 18:57
In theory. However they are not really bugs in MeCab, but problems resulting from the dictionary used with MeCab. The dictionary used can both be selected (from a very short list ;-) and can be altered or aided by user-defined dictionaries.

So really what it needs is someone familiar with MeCab to find the best dictionary available and to add fixes for the problems noted.

However it probably is never going to be possible to be 100% accurate so to get the best results manual corrections will be needed at some point.
hide replies
2010-06-10 08:44
It's not that simple.

Consider the sentence: 君たちの訳文と黒板の訳を比較しなさい。 MeCab suggests わけ as the reading of the solo 訳, whereas we all know it's やく. MeCab's usual dictionary (NAIST-JDIC) has both versions of 訳, and no amount of adding dictionaries is going to "fix" it. MeCab uses some very sophisticated AI to segment sentences, and the dictionaries have parameters derived from training on hand-segmented texts. The trouble is that ...の訳を... could be either, and you need the context of the whole sentence to decide which is which. In fact the weightings for 訳/わけ and 訳/やく as solo lexemes are the same. You could probably fiddle the weights on 訳 to make it produce やく, but most of the solo appearances of 訳 in Tatoeba are in fact わけ.

There is a whole research field of Word Sense Disambiguation (WSD) working on problems related to this, but I don't think there are any packaged solutions for Japanese that can be plugged into Tatoeba. Just be grateful we have MeCab - 20 years ago automatic Japanese segmenters were thought to be impossible to build.
2010-06-09 18:43
2010-06-09 10:38
Help wanted!

I'm looking for someone to help with Tatoeba / WWWJDIC integration. Specifically I'd like someone with database / web experience to work on tools for validating / completing the index data needed to link WWWJDIC dictionary entries to Tatoeba example sentences. If you're interested post here for more details or PM me.