Register Log in
language English

chevron_right Register

chevron_right Log in


chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio


chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

AlanF_US AlanF_US April 11, 2020, edited April 11, 2020 April 11, 2020 at 3:15:55 PM UTC, edited April 11, 2020 at 4:32:53 PM UTC link Permalink

I apologize for the length of this post in response to Aiji's request for corpus maintainers' feedback ( ). I'm writing it in a separate thread to make it easier to read the full original thread (as well as this post).

Since December, this is the process I've been following to proofread sentences:

I started by taking a list of non-native English contributors that CK compiled and sorted it by decreasing order of number of sentences. For contributors at the very top of the list, with tens of thousands of sentences, I spot-checked their sentences. But starting, I think, with contributors with thousands of sentences (with the exception of maybe half a dozen contributors whose error rates were so high, or whose sentences were so hard to correct, that I decided to correct some but come back to the rest later), I looked at all of them. As you might expect, the number of contributors with a smaller number of sentences is higher than those with a larger number of sentences, making it increasingly difficult to get through all of them, so I've temporarily stopped at contributors with 19 sentences or fewer.

One thing I confirmed from this exercise is that the overwhelming majority of the sentences contributed by non-natives are free from errors when I see them, either because they were fine when they were contributed, or because they were subsequently corrected. (However, the error rate for non-natives is higher than for natives, so if I want to find errors, it makes sense to look at non-native contributions first.) This fact shapes the way that I perform my corrections. In particular, it leads me to focus on finding errors (which "jump out at me") rather than the painfully tedious action of marking OK sentences as OK. I also know that there are far too many sentences for me to mark as OK, and I don't want to introduce biases as to which ones I select (for instance, the shortest ones, or the simplest ones, or the ones that are not depressing, or... ).

There are several advantages of proceeding contributor-by-contributor:
(1) When I first look at a contributor, I can check whether they're active (by going to their profile page and checking their last activity -- it would be nice if this involved fewer mouse clicks). This affects whether I leave comments and wait for two weeks for a response before changing the sentence or not. Other corpus maintainers have encouraged me to take that policy, saying that they don't wait for responses from inactive contributors. That policy seems to be sound, since I can't remember any instance where a previously inactive contributor "rose from the grave" and protested a sentence being fixed before the grace period had passed. In my experience, waiting out the grace period for inactive contributors makes the proofreading process unfeasible.
(2) When I see all of a contributor's sentences on a single page, it's easier to pick out the errors because I'm not distracted by the variation of style from one contributor to another.
(3) When I see that a contributor's error rate is unacceptably high, I can send them a private message asking them to correct their faulty sentences before adding new ones. If that fails, I block them until they do. (A corpus maintainer who is not an admin would need to ask an admin to do this.) In my experience, people generally abandon Tatoeba (or at least their accounts) rather than take that step. But at least that gives us an opportunity to fix their sentences without being overwhelmed by a flood of new bad ones.
(4) It's possible to keep track of one's progress in a way that would be nearly impossible when proceeding by, for instance, sentence creation date. The number of contributors is manageable.

There are two kinds of errors that are both so frequent and so mild in comparison to others that I let some instances go unmarked:
(1) non-sentences, meaning contributions that are neither valid sentences, nor valid dialogues, nor sentence fragments matching the criteria in "How To Write Good Sentences" ( )
(2) comma splices ( )

When I encounter a sentence by an active member that contains an error, especially when that error is not "non-sentence" or "comma splice", I add a "@change" tag to the sentence AND I leave a comment, usually of a form that includes both the faulty sentence and the proposed correct version or multiple possible versions. I find that the "copy sentence" button is invaluable in this. I copy the sentence contents by pressing the button, then I paste it twice and edit the second version. For instance, if the sentence said "I love writting sentences that make think", I'd leave this comment:


I love writting sentences that make think.
I love writing sentences that make one think.
I love writing sentences that make people think.


There are several reasons why I like to use this form:
(1) Indicating the incorrect form as well as the correct one means that if someone encounters the sentence after it has been corrected, and sees this comment, which is usually right under the sentence, they can tell that the problem has been fixed. (Presumably, they could figure this out through the logs, but that's much more difficult.)
(2) Giving the correct form below the incorrect form gives the contributor a mental image that can help them learn.
(3) If I provide a correct version, the correction can later be done by anyone: the contributor, another corpus maintainer, or me (saving me the work that I've already put into identifying the problem and a solution).
(4) The copy-paste-edit pattern is quicker than writing a more extended comment such as "The word is spelled 'writing', not 'writting'", which would involve more typing.
(5) I like to make my comments as objective as possible. This form is very neutral.

Periodically, I review the list of sentences tagged "@needs native check" or "@check". With these sentences, if the sentence is OK, I delete the tag and do one or more of the following:
(1) Click the OK review button. NOTE: This is the only time I use that button. As I've said, I don't believe that going through pages' worth of sentences that are OK and marking them as such is a good use of my limited time.
(2) Leave a comment (if I feel it's necessary to clarify a point).

If the sentence is not OK, I delete the tag, add a "@change" tag, and add a comment (unless the contributor is inactive, in which case I generally change the sentence myself).

Less often, I review the list of sentences tagged "@change", focusing on the ones that are older than a week. I find that if I do this step too often, it just wastes my time, since many of the sentences with this tag remain unchanged throughout the grace period.

Less frequently still, I search other sentences with tags whose names start with "@", such as "@maybe delete". This takes more time, since these tags, unlike the three I mentioned before, cannot be reached via "Improve sentences" in the user interface.

{{vm.hiddenReplies[34798] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US April 11, 2020 April 11, 2020 at 3:24:14 PM UTC link Permalink

One thing more: when I have small chunks of free time, I perform further spot-checking on sentences by non-native contributors with the largest number of sentences (tens of thousands). It's easy for me to find their sentences because I remember their usernames. Working through the remainder of the non-native contributor list (that is, contributors with 18 or fewer sentences) requires me to pull up and refer to the list and keep track of my position on it.

Thanuir Thanuir April 11, 2020 April 11, 2020 at 3:34:43 PM UTC link Permalink

"(5) I like to make my comments as objective as possible. This form is very neutral."

Varsinkin uusien ihmisten kanssa ystävällisyys ja lähestyttävyys voivat olla hyödyllisiä, jos haluaa rohkaista uutta jäsentä jatkamaan Tatoeban parissa.

Uudet käyttäjät eivät välttämättä myöskään tiedä, miten vaihtaa lauseen kieltä tai muokata lausetta. Tämän tiedon lisääminen voi mahdollistaa lauseen korjauksen.

{{vm.hiddenReplies[34801] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US April 11, 2020 April 11, 2020 at 4:08:42 PM UTC link Permalink

Here's the Google Translate English version of Thanuir's comment:

"Especially with new people, kindness and approachability can be helpful if you want to encourage a new member to continue with Tatoeba.

New users may also not know how to change the language of a sentence or edit a sentence. Adding this information may allow the sentence to be corrected."

I certainly agree that kindness and approachability are helpful. When I say that I like to make my comments as objective as possible, I mean that I focus on the sentence and how to fix it, not on making a statement about the user's ability in the language. (If they've demonstrated a pattern of making errors, I will ask them to fix their existing sentences, rather than suggest that they shouldn't be contributing at all.) The vast majority of the contributors whose sentences I proofread are not new users. When I see these sentences, I sometimes see that someone has already posted a "Welcome to Tatoeba!" message on them when they were first written.

{{vm.hiddenReplies[34803] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir April 11, 2020 April 11, 2020 at 5:37:24 PM UTC link Permalink

Uskon lähestymistapasi toimivan hyvin kauemmin Tatoebaa käyttäneiden keskuudessa.