Burada Tatoeba'nın nasıl kullanılacağı, hatalar veya garip davranışların nasıl raporlanacağı gibi genel sorular sorabilir ya da en basitinden topluluğun geri kalanı ile kaynaşabilirsiniz.
Soru sormadan önce SSS'yi okuduğunuzdan emin olun.
En son mesajlar
Wall (4142 threads)
**introducing a feature to keep unowned translations from being displayed**
to extract one central point from the recent big discussion about improving the quality of the tatoeba corpus, I would like to again address separately the issue of deleting the unowned english sentences (and possibly, the japanese and/or french ones).
ck has stated that they have checked the entire english corpus and adopted all sentences they deemed worthwile, so that the percentage of good unadopted english sentences right now should be "very low", if not zero.
*since the discussion about deleting the unowned sentences has not yielded any real results, I hereby request that they be deleted, and ask anyone interested to state whether they are for or against this.*
I want them to be deleted because it often happens that I search for a japanese word, and most if not all hits have only one direct translation in one of the languages tatoeba displays to me, english. these translations are frequently unowned sentences from the tanaka corpus, and they feel very unnatural to me. they also include many prime examples of my recently mentioned problem of tatoeba sentences lacking context (e.g. "he shuddered at the sight.": who? what sight?).
this generally makes me abandon the japanese sentence they are linked to as well because my japanese is not good enough to judge whether a sentence is good or not.
this process is always frustrating, because I have to read through sentence after sentence only to find that the english translation is rubbish, or at least untrustworthy. I would much prefer not having these translations displayed at all and speeding up my search by being able to only pay attention to sentences that are usable.
the situation is more complicated with japanese, where deleting all unowned sentences would mean losing 60-70% of the corpus, so I cannot really say anything on this one.
with french, I have only read one clearly stated opinion, which was sacredceltic's, saying that he would rather have them deleted right away because they are "a huge smear on the french corpus".
alternatively, I call for the feature of not displaying orphan sentences by default to be extended to direct and indirect translations of owned sentences. this way, even if the bad sentences remain, they are easy to avoid entirely.
another alternative would be not deleting the sentences, but simply hiding them from public display until a decision is made on how to treat them.
the only option I think we should definitely NOT go for is simply leaving things as they are.
I was misquoted. I never said "if not zero". You can see what I wrote in the 2nd paragraph of this post. https://tatoeba.org/eng/wall/sh...#message_25153
I wouldn't want to adopt those English sentences (https://tatoeba.org/eng/sentenc...amp;sort=words).
That doesn't necessarily mean that someone else won't adopt them.
Nowadays, I prefer to only adopt sentences that I think are the kind of good example sentences that I would have personally contributed.
I didn't check to see who tagged them OK, but perhaps they were tagged OK because there weren't any obvious grammar errors. I personally don't think just "lack of grammar errors" is enough for an OK tag.
It's funny how a simple tag and judgment such as "OK", or a behaviour such as adopting a sentence can mean so many different things, since they've never been clearly defined.
Personally, my policy is the following : outside of a clear definition (which I might not accept when defined...), I adopt only sentences that "I would say", except under the circumstances that I create a sentence in Tatoeba that I wouldn't say, but which, according to me, needs to be documented on Internet, because it's living but rare or interesting, and/or local, and I have confirmed the proof of its existence with local natives (for instance, I wouldn't say myself a lot of sentences that I tagged "Belgian-French", but I am sure they do exist, and I not only heard them but cross-checked them with locals, asking them to provide examples of use in context.)
But I OK-tag sentences that I know are OK, even though I wouldn't say them myself.
The nuance might seem minor, but that's actually a hell of a difference, because there are lots of sentences in my native language that I know are OK but which I don't use myself.
I don't tag "OK ", though, sentences that I disapprove of, although they might represent some kind of "valid" language...
So I apply a different policy when it comes to adoption and OK-tagging.
Here are my opinions on the 2 things that have been suggested.
**deleting unowned english sentences**
This might be a bad idea.
Eventually, I do agree it would be a good idea to get rid of these sentences. However, I think it should be done carefully.
**introducing a feature to keep unowned translations from being displayed**
This might be a good idea.
This would somewhat solve the problem. However, there would need to be a way for members to find these, so they could be reviewed, adopted and edited if needed.
Another idea would be to color code all unowned sentences, and possibly include an icon, too. We already do that with sentences that are possible copyright infringements or that are problematic in some other way.
If we chose another color for unowned sentences, then members would immediately know that the sentence may not be a good one to use for language study and probably shouldn't be translated. Perhaps it would be possible to make such sentences untranslatable, or at least not by members who haven't been registered for at least a few months. We already have it so that the "red" sentences can't have translations added by anyone.
One thing to note is that we still have many incoming English sentences that are as bad as a lot of the unadopted sentences. The same is true for Japanese, and I would assume for other languages as well.
If we could get a color-coding system working, it might be a good idea to color-code all non-native sentences, too. This would help warn members about sentences that may more likely contain errors or not sound natural.
If you want to see some color-coding ideas, see these pages. These are pages that were already online and not specifically made as examples for this discussion.
English Sentences from Tatoeba.org with Audio
Read the message at the top of the page to understand the color-coding.
Browsing Japanese Sentences, Showing Ratings
Jump to the first page to see sentences rated OK and tagged OK.
Jump to the last page to see sentences rated "not OK".
> real results, I hereby request that they be deleted, and ask anyone interested
> to state whether they are for or against this.
I think deleting them can actually affect Jim Breen as well, since he relies on the Japanese-English pairs, and not the Japanese sentences alone.
But it's not only a problem for Jim Breen. It can affect many users, it can even affect you. It all depends what situation you are in.
The reason why you want them to be deleted, if I understood correctly, is because:
- You search Japanese sentences.
- You often find that the English translations are bad.
- As result you don't trust the Japanese sentences and don't want to use them.
If we delete the bad English sentences, there are 2 situations.
1. You search Japanese sentences **translated into English**.
==> The Japanese sentences that were linked to the bad English sentences won't appear anymore.
==> As result your search results are not polluted by sentences you wouldn't trust.
==> But if the words you are searching for were only illustrated by these sentences that had bad English translations, you would get no result at all. Maybe you are fine with this trade-off, but maybe not everyone else is.
2. You search Japanese sentences **without specifying the target language**.
==> The Japanese sentences that were linked to the bad English will still appear.
==> But they won't have an English translation anymore, therefore whoever finds them and doesn't speak Japanese wouldn't even have an idea of what they mean.
I think what you actually need is a better ranking of the search results: sentences which translations are all unadopted sentences should be displayed lower in the results, and sentences which translations are all owned by trusted contributors should be displayed higher in the results.
Another thing we can consider doing as well is to mark all English unadopted sentences as unapproved. This joins CK's idea of displaying unadopted sentences in another color and would allow you to detect more easily sentences you want to trust and sentences you don't want to trust. I'm not entirely sure about this though. Even though the consequences are not as significant as deleting the sentences, there are still some consequences we need to consider.
Also I would like to remind that some people don't adopt quite good sentences just because they see some rude words or expressions. They may not use this kind of language in real life but it would be still a good example of colloquial real speech.
"I was misquoted. I never said "if not zero"."
"I didn't read the original post carefully, since I find it tedious to read so much text that doesn't have proper capitalization, so I may have missed a few things."
One of the things you missed is that I only quoted you saying the percentage was "very low"; "if not zero" was outside of the quotation marks.
I'm sorry that my disuse of capitalisation is hard for you to read. I'll try to remember this when I post something in the future.
"[opinions on deleting unowned sentences or modifying the non-display feature]"
I like the idea of using colour coding. If I remember correctly, the last time we talked about this we already addressed the issue of colourblind people and agreed to use colour coding in combination with icons. I think this could be implemented very well.
Unfortunately, I wasn't successful in my request to have people clearly state "for deleting" or "against deleting", but as far as I read the answers correctly, for now everyone is against it because they think it needs to be done carefully, even if right now no one seems to have a specific idea on what this would mean.
I thank everyone who suggested other ways to solve my personal problem with the unadopted sentences in English and Japanese, via commenting on this thread or sending a pm, and I will try all the suggestions out and see what they can do for me.
However, and this is very important:
I have not primarily started this thread because I have a problem with my search results working with the Japanese corpus.
I have started it because Tatoeba right now has a significant problem that is frequently being addressed, and yet even after the last rather big discussion about the problem, we arrived at no result at all. The discussion just sort of ended.
I may be reading too much into this right now, but I am getting the impression that another important thing Tatoeba lacks right now is a sense of determined companionship, teamwork, call it whatever. To me, subjectively, most people on here appear to be doing their thing, trying to work around things they don't like, and through everyone's individual interest or belief in the project, things eventually do change, but much less than they could.
I can think of some prime examples of people just contributing their own stuff without a lot of context or interaction with others and then simply leaving, as well as of some examples of people who do work together to achieve something that just makes Tatoeba a better site as a whole, but the point here is not to scold or praise any individual people.
The point is to tell you that I think with all the competent people that we have on here, that are constantly involved each on their own, we could gain much more momentum if the completely open landscape of the site would develop a more intimate, closely-knit core. Right now, it appears to me like it's simply TRANG's site, and everyone else is just getting involved where they please, some more and some less, but generally unable to really get together to get something moving on a larger scale, and often times even against one another instead of together. A lot of energy simply evaporates, people shout their thoughts into the prairie and then mostly go on doing their own thing again. This is also the case with this thread: Everyone states their opinions, but there is no certainty at all that in the end *anything will be implemented*. The only concrete proposal was made by CK, and so far, no one has said anything about it.
What could be happening instead is having a thread started in a forum after choosing a single problem to work on (in this case, "What should we do with the large number of unowned sentences right now?") with a 100% goal of arriving at a decision what do to with them within a week. This thread would not simply be started, but the problem would be chosen to be the next one on which a thread is started by the community.
I will try to find the time to create the poll I recently spoke of, and see if it finds any response and can actually start a change to this to a certain degree. But the paradoxical thing about this is that if I am the only one who thinks this should happen, I myself, as an individual, will not be able to make it happen.
I will therefore now open another thread with a link to this one (thanks for the suggestion, CK) where I will ask whether people are interested in this concept, and they think making such a poll would be worthwhile.
> frequently being addressed, and yet even after the last rather big discussion
> about the problem, we arrived at no result at all.
I wouldn't say "no result at all" :)
First, you got suggestions that could concretely solve your current frustration with the unadopted sentences. I would by the way be interested if using the advanced search works out for you.
And the discussion itself brought further confirmation regarding the question: why does it matter so much to improve the quality of the corpus and what can we do about it?
The main reasons that people have expressed so far, about why quality matters, are:
- Because bad sentences makes the project look bad. It makes us look not serious and we won't attract more contributors if we don't look serious.
- Because bad sentences bring a bad user experience. When there are too many of them, it takes too much effort for users to find sentences they can rely on.
Your instincts are naturally telling you that solving the problem is a matter of removing these sentences. No more bad sentences, no more problem. It is indeed a possible solution but you have to be aware that it is not a sustainable, nor scalable solution. It's like taking pain killer. It doesn't solve the root problem, but makes the pain go away for a short time. And it may have side effects that are worse that the initial pain.
If we want to address this issue in the long term, the questions we should actually try to answer are:
1. What are our criteria for good and bad quality?
2. How do we teach people to contribute sentences with better quality?
3. How do we make Tatoeba more resilient to bad quality?
These are to me quite difficult questions, but they are the questions we need to work on if we want to seriously solve the problem of quality.
One additional reason is that we aren't meeting one of the aims of the Tatoeba Project.
In a blog entry on 2009-11-28:
So the concept is : we gather a lot of data, try to organize it, ensure it is of good quality and make it freely accessible, downloadable and redistributable, so that anyone who has a great idea for a language learning application (or a language tool) can just focus on coding the application and rely on us to provide data of excellent quality.
> 1. What are our criteria for good and bad quality?
While there might be a gray area between what people consider good and bad in some cases, I think that many of us could likely agree that some of the items in the corpus are definitely wrong and should be eliminated.
There may always be disagreement on some points. For example: Is a sentence considered "good" if it's not what a native speaker would say, uses the wrong vocabulary choice or has grammar errors, but communicates the intended idea? Is a sentence considered "good" if it is utter nonsense, but is grammatically correct?
> 2. How do we teach people to contribute sentences with better quality?
One obvious way is to really encourage people to contribute in their native languages. It's very easy to sound natural in your own native language, and very easy to sound unnatural in your non-native language.
Even if some sentences by non-native speakers are good, it's really hard to trust that they are good, so members would be helping us much more by limiting their contributions to sentences in their own native languages.
> 3. How do we make Tatoeba more resilient to bad quality?
This is somewhat related to Number 2. If we increase the percentage of good quality sentences, then the bad sentences become more obvious, so members are less likely to just ignore them. If most members, or all members, resisted the urge to contribute in their non-native languages, and also kept encouraging others to do the same, we would have fewer incoming bad contributions.
We also need to make it very clear to new members that this is not a site similar to websites such a www.lang-8.com where the purpose is to have others correct what you have written in a language you are learning.
By allowing so many bad sentences to remain in the Tatoeba Corpus, things will likely get worse. The Broken Window Theory somewhat applies, I think. (https://en.wikipedia.org/wiki/B...windows_theory)
It's true that this has always been the big ambiguity of Tatoeba, which, at first, may look as a playing ground for learners and is often perceived as such by newcomers, as a result.
>The Broken Window Theory somewhat applies, I think.
It does very much.
> (or a language tool) can just focus on coding the application and rely
> on us to provide data of excellent quality.
On that topic, it seems to me that quality is currently not a big issue for other projects who want to reuse our data. The quality of our content is good enough that third parties can start developing something while having tangible data to work with. Their main issues is that they need to do a lot of work on processing our data in order to tailor it to their need (i.e. extracting only the sentences they need, restructuring it to fit into their system).
Therefore for this goal, the main priority would definitely not be improving the quality of the sentences, but rather providing tools to make it easier for other projects to reuse our data.
> While there might be a gray area between what people consider good and bad in
> some cases, I think that many of us could likely agree that some of the items
> in the corpus are definitely wrong and should be eliminated.
The gray area is the biggest issue though, isn't it? It seems to me that most the unadopted English sentences are part of this gray area.
> One obvious way is to really encourage people to contribute in their native
My current impression is that most people already contribute mostly in their native languages and that people not contributing in their native languages are not significantly dragging down the quality of the corpus. I could be wrong but it doesn't feel that we need to invest much more efforts into this than we already do, since people generally understand already that they should contribute in priority in their native languages.
What I really meant when I asked "How do we teach people to contribute sentences with better quality" was how do we teach people to improve (the improvement is what I want to focus on), regardless which language they are contributing in.
Things such as what process can someone go through in order to check the quality of their sentences, and what tools can they use to help them in that.
> If we increase the percentage of good quality sentences, then the bad sentences
> become more obvious, so members are less likely to just ignore them.
I don't think the percentage of good quality sentences has anything to do with the obviousness of bad sentences. Bad sentences become more obvious only when we have a clear and agreed definition for them.
Having a higher percentage of good quality sentences would rather trick people into believing that bad sentences are actually good, I think. If their impression is that 99.99% of the sentences are good, when they stumble upon a sentence that is in the 0.01% remaining, they would be less likely question it.
> By allowing so many bad sentences to remain in the Tatoeba Corpus, things will
> likely get worse. The Broken Window Theory somewhat applies, I think.
I have a very different vision on this. You're going with the assumption that more bad sentences means things are getting worse, and less bad sentences means things are getting better.
My assumption is that bad sentences are part of the deal. There will always be bad sentences being added and we can never stop it. So rather than spending efforts on figuring out how to reduce the number of bad sentences, I would rather spend efforts on trying to design a system in which no matter how many bad sentences you pour into it, it will still manage to deliver a good experience and a good service.
* The "OK" Rating (https://tatoeba.org/eng/collect.../ok/page:99999)
1. English sentences I recommend translating first.
(Currently this list: https://tatoeba.org/eng/sentences_lists/show/4000)
2. All English sentences I use in my projects.
(Currently this list: https://tatoeba.org/eng/sentences_lists/show/907)
* The "Unsure" Rating
3. Sentences I've chosen not to use for now and am unlikely to ever use, but I may come back and review them again for possible use.
4. Sentences I've ignored and don't want to rate. Some of these are just automatically filtered out because they are too long, contain certain common errors, or for some other reason.
* The "Not OK" Rating
5. Sentences I'm very, very unlikely to ever use. I don't plan to go back and review these again.
For all the lists used for 3 through 5, see http://bit.ly/tatoebafiltering if you are interested.
** Why? **
This would make it a lot easier for me to filter-in and filter-out sentences that I use on my own projects, since when viewing sentences I can easily see which of the 3 groups a sentence is in, and whether I've "rated" it already or not. (http://prntscr.com/a1ou23)
The problem is, of course, that the current "OK", "Unsure", and "Not OK" words don't really represent what I would mean by my "ratings."
This is still not ideal for what I would like to do, but would be a way that I could more efficiently use tatoeba.org as it is.
should allways have the same basic meaning.
As far as I understand, their meanings are approximately as follows.
According to the standards commonly respected in the concerning language.
Something seems (to me) not quite right.
Somebody (more competent than me) should check whether this sentence is OK or not.
Not according to the standards commonly respected in the concerning language.
In a comment beneath, I specified why and/or proposed necessary changes.
You may have better definitions, and every definition permits some interpretation (what is OK for one person, may be hardly acceptable for another one). But at any rate, I'm in favor of finding and sticking to common definitions. Otherwise the whole classification would become pointless, wouldn't it?
I think that now that Tatoeba has grown into a community of decent size, with some fairly constantly committed members, it is time to think of creating a team of core contributors.
Many ideas are presented on this wall, but they are often so uncoordinated that they are lost in the chaos of everyone just saying what they think should be done without any possibility of them actually doing it because in the end, they have no say in what is done with the site. There are no clearly assigned roles as to who is able to decide what.
To make better use of the people's ideas, I think we need to have some kind of interface where a discussion can be started with the fixed goal that at the end of the discussion, all suggestions are taken into account and a decision is made.
To be able to do this, I suggest we form a "family" of experienced and trustworthy Tatoebans who are familiar with each other's competences, willing to pick issues to work on in an organised manner, and then working together to make it happen.
For a more in-detail explanation of why I think we need this, see https://tatoeba.org/fra/wall/sh...#message_25456
I am thinking of taking the time to create a poll where people can say they would be willing to be members of this core community.
Do you think this would be a good idea?
If yes, what else do you think should be included in the poll?
Please let me know. Thank you.
> To make better use of the people's ideas, I think we need to have some kind of interface where a discussion can be started with the fixed goal that at the end of the discussion, all suggestions are taken into account and a decision is made.
I agree. This reminds me of the forum idea: https://tatoeba.org/wall/show_m...#message_19996
However, I think the tool is not the problem. If we’re unable to decide upon what do to after discussing a topic on the Wall, what would make using a different tool different? And like Trang said , how do we prioritize tasks? How do we gather people’s opinions in an efficient and relevant way? Since everyone have their own personal interests, I think it’s rather a political issue.
il a supprimé la plus ancienne et gardé la plus récente.
Déjà qu'il s'attribue les traductions, il n'est pas capable de faire son travail de base de dédoublonnage correctement.
Personnellement, je trouve qu'il y a tellement de problèmes avec cette routine que je suis favorable à sa suspension, en attendant des corrections et des tests exhaustifs en environnement de développement.
Actually, this feature has been around for some time but not announced. It may seem like a very picky change, but having furigana properly aligned is very helpful for learners of the Japanese language. It’s also the only proper way of displaying furigana, as it can be observed in any Japanese book, newspaper, placard… which hopefully makes Tatoeba looking a little bit more serious among Japanese people.
First, you need to check "Always show transcriptions and alternative scripts" on the settings page to see machine-generated furigana. Note that these transcriptions with a warning sign are not always correct. Transcriptions without a warning sign have been added by human contributors and are much more likely to be correct. These manual transcriptions are also found on the downloads page. I plan to provide furigana for all the sentences I've written and proofread.
As gillux says, almost all furigana are now associated individually with each kanji, but some of the kanji compounds called 熟字訓 are exceptions. For example, there's a word 明日 (ashita, あした) at the beginning of gillux's example above. It's not that 明 reads "ashi" and 日 reads "ta", or 明 "a" and 日 "shita", so the three hiragana are placed evenly above two kanji. On the other hand, when one or more of the kanji are read the normal way, the furigana is divided as in normal compounds. For example, the reading of the word 時計 (tokei, とけい) is special because 時 doesn't have the reading "to". However, since 計 does normally read "kei", the furigana と is placed on top of 時 and けい on top of 計.
I think this new system will be most useful when you're looking for sentences where a specific kanji is read in a specific way. If you're interested in doing this kind of search on the website, reply to this message to let developers know.
But as a result, the random sentences feature will not be working either in the meantime, because it relies on the search engine.
Sorry for the inconvenience!