2019-02-28 12:46 - 2019-03-01 01:37
Mozilla hat seine Sprachdatensammlung Common Voice öffentlich freigegeben
2019-02-28 19:02
Al ĉiuj: Ĉu nomoj krom 'Tomo' ne estus tre utilaj por tiu projekto? Ŝajnas, ke oni maljuste ĉikanas uzantojn, kiuj faras tiajn frazojn.

To everyone: Wouldn't names other than "Tom" be very useful for that project? It seems like users who make those sorts of sentences are unjustly hounded about it.
2019-03-01 20:45
I agree. Tom has been used too much. I know it has been adopted to avoid repetition and it is not against Tatoeba rules but I truly beliveve there are other ways to achieve such a thing.
2019-03-02 17:32
I'm not sure Tom has been used too much, but it would be nice — and useful to projects that use our data — if we had other names, especially non-Western ones. OsoHombre has used a lot of Arab names in his sentences, and ended up being repeatedly told to use the name "Tom" over and over after already giving his reasons for not doing so. I think using other names couldn't possibly hurt and is actually very helpful to some use cases, like Mozilla Common Voice.
2019-03-02 18:06
I believe there was only one person in favor of restricting sentences to this small number of names, and Trang explicitly said that this was not a Tatoeba policy. However, that person is a very prolific contributor, and is not about to start adding sentences with other names. Basically, this comes down to the following: If you want to see sentences with proper nouns other than "Tom", "Mary", "Boston", and "October", add them yourself -- but don't be surprised if you continue to see many more sentences that only use those nouns.
2019-03-03 09:47
When I started learning Romanian I used the Anki app based on the Tatoeba corpus. I remember that I got a row of sentences similar to "Tom is sleeping", "Mary is sleeping", "Fadil is sleeping" etc. It is very annoying to translate them all. In my opinion it is quite OK to use other names than Tom, but before adding a sentence please make sure that you don't create near duplicates.
2019-03-03 13:22
Is there a standard program that generates these decks? If so, it would likely be possible to change this program in a way that doesn't add near-duplicates to the deck.
2019-03-03 14:43
As shekitten suggested, and as frustrating as it can be, the problem lies in the tool, not in the source.

Having near-duplicates sentences could be very useful to a tool trying to extract information from sentences, as a training set, for example.

I think it is harder for the maker of the second tool to create near-duplicates out of nothing than for the one of the first to avoid near-duplicates that already exist.

As always, most of use-case issues lie in poorly designed tools and not in the source (Surely, the source could be improved, but you got the point I think).
2019-03-03 16:27
> If you want to see sentences with proper
> nouns other than "Tom", "Mary", "Boston",
> and "October", add them yourself

This doesn't really work due to the native speaker policy. We're discouraged from adding sentences in languages that are likely to be translated, so the impact we can have is very limited. English native speakers have unfair advantage of forcing their names (and not just names) on others.

Tatoeba badly needs decolonisation.
2019-03-04 08:09
I agree about decolonization.

Some constructive things to do:

1. Add sentences with varied names of people and places.
2. Add other sentences characteristic of less present cultures - food, clothes, politics, idioms, nature, etc.
3. When you see such sentences in other languages, translate them.
4. When you see someone discouraging others from adding sentences with varied names, for example, reply with a list of reasons why adding varied names is good and desirable.

Tatoeba is a volunteer effort, so the best way of creating change is to simply do it. It is a pity that certain cultures dominate the website, but I do not see a constructive way of preventing that. The Tom, Mary and Boston -sentences do add value to the website, so they should not be discouraged, either.
2019-03-04 10:20 - 2019-03-04 10:22
I think there are constructive ways of preventing that. It's like something George Orwell said on a different topic (political language): "A man may take to drink because he feels himself to be a failure, and then fail all the more completely because he drinks."

The main way colonial attitudes perpetuate themselves is passive. I don't need to make any personal effort to continue cultural genocide against the Seneca; it's enough that I do nothing about the fact that the English language and English cultural attitudes are the supreme currency on their land. I don't need to steal the land of Indigenous Americans personally; it's enough that my great-grandparents came here and settled on their land after it was already stolen, and that I do nothing to remedy the material harm that has resulted from this theft.

Something we could do, apart from what you're suggesting, is to make it a violation of policy to discourage people away from adding non-Western names, places, etc.. Such discouragements are assertions of long-existing hierarchies, no matter how politely they are phrased and regardless of whether the user is conscious of this. I am sure there are other things we could do, but if we begin from the point that "nothing constructive can be done," it will be a self-fulfilling prophecy.
2019-03-04 10:48 - 2019-03-04 15:42
My comment was about Tatoeba in particular, in case that was not clear.

I do not know enough about how policy on Tatoeba is settled to comment on that. I would not oppose your suggestion, if it was up to me.

Below is a comment, slightly edited, I left on a sentence when the contributor was discouraged from adding a sentence with non-standard name. Maybe you find it useful when you see this kind of counterproductive behaviour:


1. I would not discourage people from contributing, even if they contribute sentences that fit a pattern. They will probably also contribute other things.

2. Writings names in different scripts is non-trivial and language-dependent. The famous mathematician Tikhonov, Tihonov, Tychonoff, etc., and the philosopher Plato or Platon.

3. Declension of names is non-trivial in some languages. For example, in Finnish, Tomi-Tomin-Tomilla, Johannes-Johanneksen-Johanneksella. As such, having different names adds actual linguistic content. This tends to be especially true of foreign names.

4. It is most natural to write sentences that use names of the culture in question.

5. If there would be default names, which culture would they be from? I would prefer Väinö and Aino, personally, as they are good and traditional names with ties to Finnish mythology. I am sure everyone else would have different favourites

6. Different names suggest different genders in different languages. Kari is male in Finland and female in Norway, for example, by default.

7. It takes a lot of effort to police patterns. One can use the same to add new sentences and to translate sentences instead. Furthermore, this would be something one would have to teach most contributors one-by-one. Adding such requirements for contributing is not a good idea.

So: Several such sentences are not needed, but they also do not cause harm. It would take work and be highly impolite to police them. Unifying names would, in general, lead to loss of linguistically relevant content.
2019-03-04 10:59
Thanks, those are useful points. And especially #3 is a thing that English speakers can easily miss, and it really shows how useful and even necessary it is to give sentences with multiple names. If the primary common language used on Tatoeba were Russian instead of English, our default proper nouns would probably reflect the language's different declensions. If it were Turkish, our default proper nouns would probably reflect the different types of vowel harmony and consonant changes.
2019-03-03 13:05
> OsoHombre has used a lot of Arab names in his sentences

Actually, the way OsoHombre builds up his corpus isn't much different from CK's. He has his own standard names, too.

Sami <-> Tom
Layla <-> Mary
Fadil <-> John
Salima <-> Alice
Cairo <-> Boston

I think users adding original sentences in large numbers tend to adopt this wildcard policy one way or another. It has its advantages. The question is, if this policy is useful for users individually, will extending it to all original sentences in a language bring more good than harm, or vice versa?

Btw, there's a phone-number search site generating thousands of spam pages by using the patterns of Tom sentences here with different names for SEO purposes. It's interesting to see how many derivations can be done just from a single pattern.
2019-03-15 13:23
I have seen the "do not insert annotations into sentences" policy here:

While I am aware of that, I wonder if adding annotations *on top* of sentences (i.e. separately) would be a partial solution here. Below every sentence there could be an extra field, perhaps only displayed to advanced contributors, that allows marking proper nouns, and maybe even other things such as dates (in the simplest form, picture something like the "highlighter" feature of PDF annotation software). This would not have any of the drawbacks listed on the page linked above, as the annotation would be a completely optional separate field that's hidden by default.

The advantages would be that downstream users of the data (e.g. Memrise-style study deck apps, or translation software) could then attempt to replace proper nouns to add some variety to the data. This is obviously not as trivial as I make it sound, as one would have to contend with phenomena such as inflection, but it's certainly a starting point – and inflection could be dealt with downstream, or partially handled by more sophisticated annotation schemata (which could be used to mark gender, declension, etc.).
2019-03-16 07:15
Currently there exists sentence-specific tags, but nothing word-specific, as far as I know.
2019-03-01 20:45
