clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search
Aiji
2019-02-02 03:12
Can the admins explain why bots are authorized now?

I believe that past experiences have shown that it is generally not a good idea, the process being more often harmful to corpora than helpful. That for a simple reason, the creator of such bots being less motivated by a good contribution to the project than by their own interest. I respectful way to use bot-generated sentences would be to generate them on your local platform, proofread them and upload only the correct ones.

It makes me sad that when people invest time and energy to increase the global quality of a corpus, suddenly a random Johnny comes and insert thousands of sentences beating that down, with zero effort, not showing the minimum respect as to proofread them. (Voltaire + Maxence = 11K, although one of them is supposed to be human)

Now I see that Voltaire has been red-marked. I guess that is CK's doing after one of my comment yesterday. That sill does not explain why it was accepted at first :)
hide replies
PaulP
2019-02-02 10:39
I agree. There were hundreds of wrong sentences. Even just putting the @change flag on them costs other people a lot of wasted time.
TRANG
2019-02-04 17:21
> Can the admins explain why bots are authorized now?

Bots were never forbidden on Tatoeba. We actually use a bot ourselves: Horus.

As long as the contributions comply to our quality standards and as long as our server can support the load, it shouldn't matter whether these contributions are from a bot or from a human.

Voltaire was red-marked by myself. CK usually won't meddle with sentences in languages other than English.

Voltaire was also "accepted" by myself, but "accepted" is not exactly the correct word. We technically cannot know if an account is run by a bot or by a human, so we cannot exactly give our approval, nor stop people, from using bots.

In the case of Voltaire, the author was transparent and contacted us to ask our policies about bots. I replied to him what I told you above: it doesn't matter if it's a bot or a human who adds the sentence as long as the quality is good and the quantity doesn't slow down the website.

He wanted initially wanted to run a translation bot, but then shifted to a bot that simply adds sentences in French, probably because he was not confident on the quality of the translation bot.

Since he was transparent with us and had acknowledged the quantity and quality requirements, I did not suspend his bot right off the bat. I, of course, reported to him that his bot is not adding good sentences, and after a couple of days, asked him to stop the bot and start cleaning up the sentences instead of adding more of them. He did stop the bot but did not do much about the clean up.

> It makes me sad that when people invest time and energy to increase the global
> quality of a corpus, suddenly a random Johnny comes and insert thousands of
> sentences beating that down, with zero effort, not showing the minimum respect
> as to proofread them.

Note that it equally takes zero effort to delete all sentences from a specific account. There is, in my opinion, no reason to be upset about massive addition of bad sentences.
hide replies
deniko
2019-02-04 17:36
> There is, in my opinion, no reason to be upset about massive addition of bad sentences.

Depends on the person, I guess.

I would be rather upset if someone started adding a lot of bad Ukrainian sentences. I feel like the quality of the Ukrainian corpus is my responsibility, and that would definitely hurt, especially if I felt like there were so many of them I can't really proofread them.

hide replies
TRANG
2019-02-04 18:01
Then perhaps there is a misconception about what is the role or responsibility of a corpus maintainer.

Keep in mind that every contributor is responsible for the quality of the corpus, not just corpus maintainers. You have a little bit more power to maintain a good level of quality, but that doesn't mean everything is on your shoulders.

You are in no way obligated to proofread all the Ukrainian sentences that are added to Tatoeba. You proofread what you can, what you want, and at your own pace. If there's a user who has added too many bad contributions, your first reflex should be to report the user, not to try to fix what they did wrong.
hide replies
deniko
2019-02-04 18:05
> You are in no way obligated to proofread all the Ukrainian sentences

I know that, thank you.
marafon
2019-02-05 06:56
> He initially wanted to run a translation bot

Unfortunately, he did run it and we got hundreds of bad links and sentences like this:

https://tatoeba.org/rus/sentences/show/7701109
Quel Pachuca pour Toluca ?
https://tatoeba.org/rus/sentences/show/7702274
C'est la nouvelle secrétaire et elle fait tout pour le chingadazo.
https://tatoeba.org/rus/sentences/show/7701955
Où nous neiges en janvier.
https://tatoeba.org/rus/sentences/show/7702300
Nous avons souvent paviamos.
https://tatoeba.org/rus/sentences/show/7699888
Prenez cette pilule. Ça t'aidera à dormir.
https://tatoeba.org/rus/sentences/show/7698693
Viens avoir soif, s'il te plaît.
https://tatoeba.org/rus/sentences/show/7701273
DepuDepuis quand avez-vous remarqué le mouvement du fœtus ?is quanDepuis quand avez-vous remarqué le mouvement du fœtus ?d avez-vous remarqué le mouvement du fœtus ?
etc.

p.s. Merci @Aiji.
AlanF_US
2019-02-04 19:43
> Note that it equally takes zero effort to delete all sentences from a specific account.
> There is, in my opinion, no reason to be upset about massive addition of bad sentences.

It may take a small amount of effort for an admin with the proper knowledge and permissions to execute a command that deletes all sentences from a specific account. (Maybe there are two or three such admins.) However, it takes a significant amount of effort to:
- identify the problem
- contact an admin to report it
- decide how to go about solving it
- figure out how to stop additional people from wasting their time reporting or trying to resolve the problem
- figure out how to ignore the bad sentences and look for good ones to translate (or answer a query)

Even more effort is involved if you want to selectively delete only some of the sentences from an account.

I think it would be a good idea for someone to come up with guidelines about:
- the maximum number of sentences that should be contained in a single mass-import spreadsheet
- the maximum rate at which sentences should be added (by a human or bot)
- the maximum number of egregiously bad sentences from a contributor (human or bot) before we temporarily suspend them

And I propose that unless and until we have a quick-response team, we should go on record as having a policy of encouraging mass import (preceded by proofreading) and discouraging or even disallowing bots other than official Tatoeba ones like Horus. I realize that at present, our mass import function is not working, so that means no batch introduction of sentences. But that seems reasonable to me until we're in a better state.
hide replies
Aiji
2019-02-05 12:45
> And I propose that unless and until we have a quick-response team, we should go on
> record as having a policy of encouraging mass import (preceded by proofreading) and
> discouraging or even disallowing bots other than official Tatoeba ones like Horus.

Alleluia.


My problem is that I tend to go with the last option that you proposed (cleaning the evil to keep the good). While we could simply massively cleanse everything...
TRANG
2019-02-06 20:56
On the topic of bots.

We cannot reliably distinguish between a bot and a human. There is no easy programmatic way to block bots from the website.

We can rely on some clues, such as the speed at which a user adds sentences (it's unlikely that a human can add more than 1 sentence per second), and we can notice patterns in sentences that are likely to be from bots, but those are human assessments and require some initial inputs from the user.

We could rely on some technical clues (such as the user agent), but if bots creators wanted to bypass our restrictions, they can always figure out something.

Trying to restrict access to website based on the nature of the user (bot or human) is not a productive approach. It would be a time-consuming and never-ending war.

One thing we can and should do, regardless of bots of humans, is to regulate the amount of sentences added within a period of time. For instance no more than 10 sentences per minute and no more than 1000 sentences per day. These numbers could be lower for newer contributors.

This is what you suggested (maximum rate at which sentences should be added) and has been suggested as well in https://github.com/Tatoeba/tatoeba2/issues/1492. There is not yet a final decision on how we'll implement this and what are the limits so feel free to comment over there if you have more specific ideas.

You also mentioned other limitations:

> maximum number of sentences that should be contained in a single
> mass-import spreadsheet

This should rather be limited by the maximum size of the file that can be uploaded (1 MB for instance).

> the maximum number of egregiously bad sentences from a contributor
> (human or bot) before we temporarily suspend them

I don't think it would change much our current processes to decide on such a number. We usually suspend users when they're reported to us and I think it's fine to keep relying on our intuition for that.
TRANG
2019-02-06 21:14
On the topic of how we deal with bad contributions.

You've mentioned that it takes effort to go from the identification of bad contributors to the resolution the problem.

I don't think the effort really grows proportionally with the amount of contributions. It can happen that the effort spent on dealing with a user who has 100 contributions is higher than the effort spent in dealing with user who has 1000 contributions.

Now if I go through specifically each of your points where effort is needed:

- to identify the problem → Normally, a quick glance at a user's sentences should be enough to identify the problem. If it takes more than two minutes to decide if a user should be reported, then they are probably not that bad.
- to contact an admin to report it → Here as well it shouldn't take too much effort from the reporter. Admins would just need three pieces of information: the username, the languages affected, a rough estimation the % of bad sentences in each language.
- to decide how to go about solving it → I'd say we have already a process for this: we suspend the user, ask them to correct their sentences. If the user doesn't do anything, mark all sentences as unapproved. If the user corrects their sentences, unsuspend them.
- to figure out how to stop additional people from wasting their time reporting or trying to resolve the problem → Suspending a user should be enough to let people know that a user has already been reported. One improvement we could add is to display a warning somewhere on the sentence's page, so that people are more easily aware the user has been reported.
- to figure out how to ignore the bad sentences and look for good ones to translate (or answer a query) → That's more of a problem for people who translate the latest sentences, but shouldn't be a huge issue for people who look for random sentences or for older sentences first. Here as well we can implement some improvement to let people know how old a sentence is, and from there they can decide if they want to risk themselves translating it.
- to selectively delete only some of the sentences from an account → If we ever wanted to selectively delete only some sentences, it would mean that the contributions are somewhat worthy and we have mixed feelings about their uselessness. If we consider that the contributions are only adding extra work that is not worth our time, we would not consider selectively deleting sentences.

When it comes to bots, I don't think we should have much remorse wiping out all sentences. But I must say (again) that I don't think deleting sentences will always helps us.
If we keep the sentences and just mark them as unapproved, they cannot be re-added (at least as long as the deduplication feature works properly). If we delete the sentences, there's always a risk that the bot owner runs their bot again to re-add the same sentences, then we could delete again, but the cycle can continue as long as the bot owner finds ways to bypass restrictive measure we may put in place.
hide replies
CK
CK
2019-02-07 01:02
> If the user doesn't do anything, mark all sentences as unapproved.
> ...
> But I must say (again) that I don't think deleting sentences will always helps us. If we keep the sentences and just mark them as unapproved, they cannot be re-added (at least as long as the deduplication feature works properly)

In my opinion, it would help a lot to delete all the sentences by such a member for the following reason.

Having a "good" sentence from an untrustworthy source, prevents a trustworthy member from adding the same sentence. This is also one good reason to ask members to limit themselves to contributing sentences in their own native languages. It's a lot easier to trust that a sentence is correct and natural-sounding if it's from a native speaker.


TRANG
2019-02-06 21:23
On the topic of quality in general.

First, I wrote the following article on the wiki: https://en.wiki.tatoeba.org/art...rove-sentences
I hope it can help contributors become more aware of the quality aspect and how to help more efficiently on that front.

Second, I hope this year we can start shifting our mindset on how we should handle quality. I hope we can stop asking ourselves "how do we prevent users like Maxence from running bot experiments", or "how do we stop contributors from adding sentences languages they are not native of", and instead, try to solve actively the question of how do we let these people contribute. What do we need to change in Tatoeba so that these people can contributing without being an annoyance or a burden?

We already do have beginning of solutions -- this isn't exactly a new problem -- but I hope we can make them more concrete.
hide replies
Aiji
2019-02-10 14:39
Let's hope that in time "Using a bot to split and add whole classical novels without any added value into the corpus." will not rise to "contribution of quality" :)
But of course, everybody is free as long as they follow the rules.
CK
CK
2019-02-05 03:59
> There is, in my opinion, no reason to be upset about massive addition of bad sentences.

There may be no reason to be "upset", but bad sentences shouldn't be tolerated if there is an easy way to avoid them. Having bad sentences in the corpus does a disservice to anyone using the data. While it may be impossible to be 100% error-free, that should be what we strive for.

[#6106141] It's bad enough to learn something correctly, and then get it wrong. It's a lot worse to spend time learning something that is incorrect, thinking it's correct. (CK)

hide replies
AlanF_US
2019-02-05 22:54
I agree.