menu
Tatoeba
language
Registriĝi Ensaluti
language Esperanto
menu
Tatoeba

chevron_right Registriĝi

chevron_right Ensaluti

Foliumi

chevron_right Montri hazardan frazon

chevron_right Foliumi laŭ lingvo

chevron_right Foliumi laŭ listo

chevron_right Foliumi laŭ etikedo

chevron_right Foliumi sonregistraĵojn

Komunumo

chevron_right Muro

chevron_right Listo de ĉiuj membroj

chevron_right Lingvoj de la membroj

chevron_right Denaskaj parolantoj

search
clear
swap_horiz
search
pullnosemans pullnosemans 2016-februaro-08 2016-februaro-08 05:18:41 UTC link Konstanta ligilo

**deleting unowned english sentences**
**introducing a feature to keep unowned translations from being displayed**


to extract one central point from the recent big discussion about improving the quality of the tatoeba corpus, I would like to again address separately the issue of deleting the unowned english sentences (and possibly, the japanese and/or french ones).

ck has stated that they have checked the entire english corpus and adopted all sentences they deemed worthwile, so that the percentage of good unadopted english sentences right now should be "very low", if not zero.


*since the discussion about deleting the unowned sentences has not yielded any real results, I hereby request that they be deleted, and ask anyone interested to state whether they are for or against this.*


I want them to be deleted because it often happens that I search for a japanese word, and most if not all hits have only one direct translation in one of the languages tatoeba displays to me, english. these translations are frequently unowned sentences from the tanaka corpus, and they feel very unnatural to me. they also include many prime examples of my recently mentioned problem of tatoeba sentences lacking context (e.g. "he shuddered at the sight.": who? what sight?).
this generally makes me abandon the japanese sentence they are linked to as well because my japanese is not good enough to judge whether a sentence is good or not.
this process is always frustrating, because I have to read through sentence after sentence only to find that the english translation is rubbish, or at least untrustworthy. I would much prefer not having these translations displayed at all and speeding up my search by being able to only pay attention to sentences that are usable.


the situation is more complicated with japanese, where deleting all unowned sentences would mean losing 60-70% of the corpus, so I cannot really say anything on this one.
with french, I have only read one clearly stated opinion, which was sacredceltic's, saying that he would rather have them deleted right away because they are "a huge smear on the french corpus".


alternatively, I call for the feature of not displaying orphan sentences by default to be extended to direct and indirect translations of owned sentences. this way, even if the bad sentences remain, they are easy to avoid entirely.
another alternative would be not deleting the sentences, but simply hiding them from public display until a decision is made on how to treat them.

the only option I think we should definitely NOT go for is simply leaving things as they are.

{{vm.hiddenReplies[25446] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
odexed odexed 2016-februaro-08 2016-februaro-08 05:31:38 UTC link Konstanta ligilo

> so that the percentage of good unadopted english sentences right now should be "very low", if not zero.

If so, I wonder why there are still unowned English sentences tagged 'OK'

https://tatoeba.org/eng/sentenc...io=&sort=words

The same for Japanese
https://tatoeba.org/eng/sentenc...io=&sort=words

{{vm.hiddenReplies[25447] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
CK CK 2016-februaro-08, modifita 2019-oktobro-30 2016-februaro-08 08:05:36 UTC, modifita 2019-oktobro-30 10:41:19 UTC link Konstanta ligilo

[not needed anymore- removed by CK]

{{vm.hiddenReplies[25448] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
sacredceltic sacredceltic 2016-februaro-09 2016-februaro-09 00:50:54 UTC link Konstanta ligilo

>I personally don't think just "lack of grammar errors" is enough for an OK tag.

It's funny how a simple tag and judgment such as "OK", or a behaviour such as adopting a sentence can mean so many different things, since they've never been clearly defined.

Personally, my policy is the following : outside of a clear definition (which I might not accept when defined...), I adopt only sentences that "I would say", except under the circumstances that I create a sentence in Tatoeba that I wouldn't say, but which, according to me, needs to be documented on Internet, because it's living but rare or interesting, and/or local, and I have confirmed the proof of its existence with local natives (for instance, I wouldn't say myself a lot of sentences that I tagged "Belgian-French", but I am sure they do exist, and I not only heard them but cross-checked them with locals, asking them to provide examples of use in context.)

But I OK-tag sentences that I know are OK, even though I wouldn't say them myself.
The nuance might seem minor, but that's actually a hell of a difference, because there are lots of sentences in my native language that I know are OK but which I don't use myself.
I don't tag "OK ", though, sentences that I disapprove of, although they might represent some kind of "valid" language...

So I apply a different policy when it comes to adoption and OK-tagging.

TRANG TRANG 2016-februaro-09 2016-februaro-09 09:41:05 UTC link Konstanta ligilo

> since the discussion about deleting the unowned sentences has not yielded any
> real results, I hereby request that they be deleted, and ask anyone interested
> to state whether they are for or against this.

I think deleting them can actually affect Jim Breen as well, since he relies on the Japanese-English pairs, and not the Japanese sentences alone.
But it's not only a problem for Jim Breen. It can affect many users, it can even affect you. It all depends what situation you are in.

The reason why you want them to be deleted, if I understood correctly, is because:
- You search Japanese sentences.
- You often find that the English translations are bad.
- As result you don't trust the Japanese sentences and don't want to use them.

If we delete the bad English sentences, there are 2 situations.

1. You search Japanese sentences **translated into English**.
==> The Japanese sentences that were linked to the bad English sentences won't appear anymore.
==> As result your search results are not polluted by sentences you wouldn't trust.
==> But if the words you are searching for were only illustrated by these sentences that had bad English translations, you would get no result at all. Maybe you are fine with this trade-off, but maybe not everyone else is.

2. You search Japanese sentences **without specifying the target language**.
==> The Japanese sentences that were linked to the bad English will still appear.
==> But they won't have an English translation anymore, therefore whoever finds them and doesn't speak Japanese wouldn't even have an idea of what they mean.

I think what you actually need is a better ranking of the search results: sentences which translations are all unadopted sentences should be displayed lower in the results, and sentences which translations are all owned by trusted contributors should be displayed higher in the results.

Another thing we can consider doing as well is to mark all English unadopted sentences as unapproved. This joins CK's idea of displaying unadopted sentences in another color and would allow you to detect more easily sentences you want to trust and sentences you don't want to trust. I'm not entirely sure about this though. Even though the consequences are not as significant as deleting the sentences, there are still some consequences we need to consider.

{{vm.hiddenReplies[25451] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
odexed odexed 2016-februaro-09 2016-februaro-09 10:34:01 UTC link Konstanta ligilo

I think it could be some kind of solution if pullnosemans used 'Advanced search' where he can set the option 'Is orphan' to 'No' on the right side (concerning Translations) so that he wouldn't see any unadopted English translations in the search results.

Also I would like to remind that some people don't adopt quite good sentences just because they see some rude words or expressions. They may not use this kind of language in real life but it would be still a good example of colloquial real speech.

pullnosemans pullnosemans 2016-februaro-10, modifita 2016-februaro-10 2016-februaro-10 02:19:37 UTC, modifita 2016-februaro-10 05:41:41 UTC link Konstanta ligilo

**to CK**
"I was misquoted. I never said "if not zero"."
"I didn't read the original post carefully, since I find it tedious to read so much text that doesn't have proper capitalization, so I may have missed a few things."

One of the things you missed is that I only quoted you saying the percentage was "very low"; "if not zero" was outside of the quotation marks.
I'm sorry that my disuse of capitalisation is hard for you to read. I'll try to remember this when I post something in the future.


"[opinions on deleting unowned sentences or modifying the non-display feature]"

I like the idea of using colour coding. If I remember correctly, the last time we talked about this we already addressed the issue of colourblind people and agreed to use colour coding in combination with icons. I think this could be implemented very well.




**to everyone**

Unfortunately, I wasn't successful in my request to have people clearly state "for deleting" or "against deleting", but as far as I read the answers correctly, for now everyone is against it because they think it needs to be done carefully, even if right now no one seems to have a specific idea on what this would mean.

I thank everyone who suggested other ways to solve my personal problem with the unadopted sentences in English and Japanese, via commenting on this thread or sending a pm, and I will try all the suggestions out and see what they can do for me.


However, and this is very important:
I have not primarily started this thread because I have a problem with my search results working with the Japanese corpus.
I have started it because Tatoeba right now has a significant problem that is frequently being addressed, and yet even after the last rather big discussion about the problem, we arrived at no result at all. The discussion just sort of ended.


I may be reading too much into this right now, but I am getting the impression that another important thing Tatoeba lacks right now is a sense of determined companionship, teamwork, call it whatever. To me, subjectively, most people on here appear to be doing their thing, trying to work around things they don't like, and through everyone's individual interest or belief in the project, things eventually do change, but much less than they could.
I can think of some prime examples of people just contributing their own stuff without a lot of context or interaction with others and then simply leaving, as well as of some examples of people who do work together to achieve something that just makes Tatoeba a better site as a whole, but the point here is not to scold or praise any individual people.
The point is to tell you that I think with all the competent people that we have on here, that are constantly involved each on their own, we could gain much more momentum if the completely open landscape of the site would develop a more intimate, closely-knit core. Right now, it appears to me like it's simply TRANG's site, and everyone else is just getting involved where they please, some more and some less, but generally unable to really get together to get something moving on a larger scale, and often times even against one another instead of together. A lot of energy simply evaporates, people shout their thoughts into the prairie and then mostly go on doing their own thing again. This is also the case with this thread: Everyone states their opinions, but there is no certainty at all that in the end *anything will be implemented*. The only concrete proposal was made by CK, and so far, no one has said anything about it.

What could be happening instead is having a thread started in a forum after choosing a single problem to work on (in this case, "What should we do with the large number of unowned sentences right now?") with a 100% goal of arriving at a decision what do to with them within a week. This thread would not simply be started, but the problem would be chosen to be the next one on which a thread is started by the community.

I will try to find the time to create the poll I recently spoke of, and see if it finds any response and can actually start a change to this to a certain degree. But the paradoxical thing about this is that if I am the only one who thinks this should happen, I myself, as an individual, will not be able to make it happen.
I will therefore now open another thread with a link to this one (thanks for the suggestion, CK) where I will ask whether people are interested in this concept, and they think making such a poll would be worthwhile.

{{vm.hiddenReplies[25456] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
TRANG TRANG 2016-februaro-12 2016-februaro-12 20:16:46 UTC link Konstanta ligilo

​> I have started it because Tatoeba right now has a significant problem that is
> frequently being addressed, and yet even after the last rather big discussion
> about the problem, we arrived at no result at all.

I wouldn't say "no result at all" :)

First, you got suggestions that could concretely solve your current frustration with the unadopted sentences. I would by the way be interested if using the advanced search works out for you.

And the discussion itself brought further confirmation regarding the question: why does it matter so much to improve the quality of the corpus and what can we do about it?

The main reasons that people have expressed so far, about why quality matters, are:

- Because bad sentences makes the project look bad. It makes us look not serious and we won't attract more contributors if we don't look serious.
- Because bad sentences bring a bad user experience. When there are too many of them, it takes too much effort for users to find sentences they can rely on.

Your instincts are naturally telling you that solving the problem is a matter of removing these sentences. No more bad sentences, no more problem. It is indeed a possible solution but you have to be aware that it is not a sustainable, nor scalable solution. It's like taking pain killer. It doesn't solve the root problem, but makes the pain go away for a short time. And it may have side effects that are worse that the initial pain.

If we want to address this issue in the long term, the questions we should actually try to answer are:

1. What are our criteria for good and bad quality?
2. How do we teach people to contribute sentences with better quality?
3. How do we make Tatoeba more resilient to bad quality?

These are to me quite difficult questions, but they are the questions we need to work on if we want to seriously solve the problem of quality.

{{vm.hiddenReplies[25470] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
CK CK 2016-februaro-13, modifita 2019-oktobro-30 2016-februaro-13 01:55:37 UTC, modifita 2019-oktobro-30 10:40:40 UTC link Konstanta ligilo

[not needed anymore- removed by CK]

{{vm.hiddenReplies[25471] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
sacredceltic sacredceltic 2016-februaro-13 2016-februaro-13 10:12:23 UTC link Konstanta ligilo

>We also need to make it very clear to new members that this is not a site similar to websites such a www.lang-8.com where the purpose is to have others correct what you have written in a language you are learning.

It's true that this has always been the big ambiguity of Tatoeba, which, at first, may look as a playing ground for learners and is often perceived as such by newcomers, as a result.

>The Broken Window Theory somewhat applies, I think.

It does very much.

TRANG TRANG 2016-februaro-13 2016-februaro-13 16:13:59 UTC link Konstanta ligilo

> ​so that anyone who has a great idea for a language learning application
> (or a language tool) can just focus on coding the application and rely
> on us to provide data of excellent quality.

On that topic, it seems to me that quality is currently not a big issue for other projects who want to reuse our data. The quality of our content is good enough that third parties can start developing something while having tangible data to work with. Their main issues is that they need to do a lot of work on processing our data in order to tailor it to their need (i.e. extracting only the sentences they need, restructuring it to fit into their system).
Therefore for this goal, the main priority would definitely not be improving the quality of the sentences, but rather providing tools to make it easier for other projects to reuse our data.


> While there might be a gray area between what people consider good and bad in
> some cases, I think that many of us could likely agree that some of the items
> in the corpus are definitely wrong and should be eliminated.

The gray area is the biggest issue though, isn't it? It seems to me that most the unadopted English sentences are part of this gray area.


> One obvious way is to really encourage people to contribute in their native
> languages.

My current impression is that most people already contribute mostly in their native languages and that people not contributing in their native languages are not significantly dragging down the quality of the corpus. I could be wrong but it doesn't feel that we need to invest much more efforts into this than we already do, since people generally understand already that they should contribute in priority in their native languages.

What I really meant when I asked "How do we teach people to contribute sentences with better quality" was how do we teach people to improve (the improvement is what I want to focus on), regardless which language they are contributing in.
Things such as what process can someone go through in order to check the quality of their sentences, and what tools can they use to help them in that.


> If we increase the percentage of good quality sentences, then the bad sentences
> become more obvious, so members are less likely to just ignore them.

I don't think the percentage of good quality sentences has anything to do with the obviousness of bad sentences. Bad sentences become more obvious only when we have a clear and agreed definition for them.

Having a higher percentage of good quality sentences would rather trick people into believing that bad sentences are actually good, I think. If their impression is that 99.99% of the sentences are good, when they stumble upon a sentence that is in the 0.01% remaining, they would be less likely question it.


> By allowing so many bad sentences to remain in the Tatoeba Corpus, things will
> likely get worse. The Broken Window Theory somewhat applies, I think.

I have a very different vision on this. You're going with the assumption that more bad sentences means things are getting worse, and less bad sentences means things are getting better.

My assumption is that bad sentences are part of the deal. There will always be bad sentences being added and we can never stop it. So rather than spending efforts on figuring out how to reduce the number of bad sentences, I would rather spend efforts on trying to design a system in which no matter how many bad sentences you pour into it, it will still manage to deliver a good experience and a good service.

{{vm.hiddenReplies[25473] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
CK CK 2016-februaro-15, modifita 2019-oktobro-30 2016-februaro-15 22:21:12 UTC, modifita 2019-oktobro-30 10:40:20 UTC link Konstanta ligilo

[not needed anymore- removed by CK]

{{vm.hiddenReplies[25500] ? 'expand_more' : 'expand_less'}} kaŝi la respondojn montri la respondojn
sacredceltic sacredceltic 2016-februaro-15 2016-februaro-15 23:03:21 UTC link Konstanta ligilo

>we have a number of advanced contributors and even corpus maintainers who contribute in way too many languages.

I fully agree with this.

>Even if some sentences by non-native speakers are good, it's really hard to trust that they are good, so members would be helping us much more by limiting their contributions to sentences in their own native languages.

However, it's part of the attraction of such a service to play a bit with the languages you like or usually practice in your life. We have to balance this, because not many just want to translate in their own "main" language. Actually, some don't want to do it because the sentences they produce in their own native language are not too good either...
The problem with multilingual people is, they are the very people to be interested in languages, but they are not the best in one given language, because they're often somewhat mixed up in between cultures...