menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
TRANG TRANG August 3, 2019 August 3, 2019 at 11:02:27 PM UTC link Permalink

**Should we stop sentences with Tom and Mary?**

The topic of Tom and Mary has been largely discussed on the Wall. Some notable threads are:
https://tatoeba.org/eng/wall/sh...#message_31468
https://tatoeba.org/eng/wall/sh...#message_31422
https://tatoeba.org/eng/wall/sh...#message_32136

It's becoming this huge elephant in the room and we haven't really taken any measure against it. Maybe it's finally time.

For the context, up until now, our policy was to let everyone use whatever names they want when creating sentences. But we do have a problem with the proliferation of "Tom" and "Mary" in our corpus.

If you want stats, here's what I've got from using the search:
- 368,224 English sentences containing "=Tom"
- 145,365 English sentences containing "=Mary"

To put this into context: 30% of the English sentences contain the name "Tom". The words "he" or "she" don't even have that many occurrences. This feels very imbalanced.

From what I have gathered, the main reasons for using Tom and Mary revolve around avoiding near-duplicates. But we don't have evidence that near-duplicates really are that much of an issue. And even if they were a big issue, solving this issue by using wildcard names is not a good solution. It creates some other major issues.

1) Many people are annoyed of seeing and translating all these Tom/Mary sentences, and a few of them have expressed that on the Wall. Some people might have quit contributing because of this, but I have no evidence on that.
2) The corpus feels much less diverse than it could be. But diversity is an important value for Tatoeba. I would argue that diversity is a component of quality: a good quality corpus is a diverse corpus. I would also argue that the lack of diversity makes people feel less included and less connected to the project.

I don't think we can restore the balance without reviewing our policies. Yes, contributors can create sentences with different names than Tom and Mary, but that's never going to be enough if other people continue to add sentences with Tom and Mary.

So my proposal is the following.

1) We officially declare that adding sentences with Tom and Mary in large quantities is harmful to Tatoeba. Too much of anything is harmful, and if do we agree that we have too much of Tom and Mary, then the most natural thing to do is to just stop.
2) Translating sentences with Tom and Mary is still okay. We already have all these sentences, there is no turning back now. The primary goal is to stop introducing tons of new sentences with Tom and Mary.
3) No one will be an outlaw for adding a few new sentences with Tom and Mary. It's understandable that these names may still appear in some new sentences. Maybe the contributor doesn't know about our issue with these names or maybe there was a good reason for using these names in the specific context of their sentences.
4) But if someone continues to add a significant amount of sentences with Tom and Mary despite being informed, they should be reported to the community admins.
5) If someone is consciously adding near-duplicates by replacing Tom/Mary sentences with other names, they should also be reported to the admins. It goes against the goal of having a diverse corpus. Near-duplicates will happen, inevitably. But they should happen organically, not because someone decided to transform themselves into a bot.
6) When a sentence has not been translated yet and contains Tom or Mary, we should explain our problem with these names and encourage the owner to use another name if possible.

Whatever issues or inconveniences arise because of people using a more diverse set of names, we will solve them, but with another solution than enforcing wildcard names. I'm sure we are smart enough to come up with something better.

Let me know what you think.

{{vm.hiddenReplies[32296] ? 'expand_more' : 'expand_less'}} hide replies show replies
deyta deyta August 4, 2019 August 4, 2019 at 12:18:01 AM UTC link Permalink

A new era begins in Tatoeba.

{{vm.hiddenReplies[32298] ? 'expand_more' : 'expand_less'}} hide replies show replies
Ricardo14 Ricardo14 August 4, 2019 August 4, 2019 at 8:56:41 PM UTC link Permalink

+1.000.000

AlanF_US AlanF_US August 4, 2019, edited August 4, 2019 August 4, 2019 at 12:27:08 AM UTC, edited August 4, 2019 at 12:00:56 PM UTC link Permalink

I'm glad to have you say that, Trang.

As Impersonator pointed out ( https://tatoeba.org/eng/wall/sh...#message_32144 ), there's a third major issue with "Tom" and "Mary": in languages like Russian, foreign names are often not declined where native names would be. This affects not only Tatoeba but the sites that use its sentences. In particular, the fact that the Clozemaster ( https://www.clozemaster.com/ ) corpus was taken from Tatoeba means that when I do exercises there, I have virtually no Russian names to practice with. That is a major hole. As a result, when I write original sentences in English that I want to be translated into Russian, I make a point of using Russian names (as represented in English) like Ivan, Masha, Sergey, and so on. Once enough such sentences have been translated, I plan to ask the Clozemaster administrators to import them.

{{vm.hiddenReplies[32299] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 4, 2019 August 4, 2019 at 11:16:01 AM UTC link Permalink

> there's a third major issue with "Tom" and "Mary": in languages like Russian, foreign
> names are often not declined where native names would be.

This problem can actually be solved without the drastic policy of stopping sentences with Tom and Mary. As long as Russian contributors take the time to create sentences with other names, or take the time to translate sentences that contain other names than Tom and Mary, then it works. It wouldn't matter if there are one million sentences out there with Tom and Mary, as long as there's a few thousand with some other names.

For the problems I mentioned, which in the end is maybe just one problem (diversity), I do not see any other way than my proposal. But I'm still open to other ideas. I don't always have the best ideas and maybe there's something better we can do.

Aiji Aiji August 4, 2019 August 4, 2019 at 1:44:45 AM UTC link Permalink

My opinion has changed several times about that topic, and the most recent one is, as I argued here before, that near-duplicates is a non-problem. And maybe trying to avoiding them is actually the problem.

Shortly:
- Suppose that every sentences containing Tom would appear a second time with another name. Those duplicates would then represent 5% of the corpus. There would be need to count their translations but if we scale on the current data, we would more or less reach that number. 5% is not such a big deal. It makes sense to think that to reach these 5%, if nobody tries to play the smart boy, it would take as much time from now as it took from the beginning to now.
- Even if there are duplicates, what is the big deal? It is a good point from many point of views. As mentioned, diversity. Russian names, Japanese names, Chinese names, etc.
It also increases the number of occurrences of a pattern. Nobody complains about "I visited my uncle / mom / sister / brother / the plumber yesterday". So why complain about "Tom / Makiko / Lin / Olga visited me yesterday"? If only one occurrence of a pattern exists, it biases everything, human and machine understanding included.
- "Near-duplicates is pollution of database". Oh la la, mince alors. That's a design issue, not a contribution issue. Actually that's a non-issue for still many years to come - we don't even have 10 millions sentences.
- "People will lose time translating near-duplicates where they could translate more useful sentences". Then, please provide more useful sentences instead of boring five-word first-year level ones.

soliloquist soliloquist August 4, 2019 August 4, 2019 at 6:16:14 AM UTC link Permalink

If the concern is about multiculturalism, we should encourage users to contribute original sentences about their culture using local names. Since many of the sentences here are translated from English (either as first-hand translations or translations of translations) and there is a contribute-in-your-native-language policy, English (American) names will likely continue to dominate the corpus in one way or another. The real point is that CK is the most active user here. If you convinced him to use different names, so names like Sam, Joe or Jane began to spread instead of Tom and Mary, would it solve the issue?

Another approach might be to encourage using personal pronouns rather than proper names, so sentences would be more culture-neutral.

I'm not bothered by the abundance of Tom sentences here. Tom is a short, well-known (thanks to Tom & Jerry), easy-to-write and easy-to-pronounce name. Besides, there are other Web sites using corpora from different sources including Tatoeba. And not all of them properly give credit to the original sources. Tom sentences indirectly serve as a trademark on such places.

{{vm.hiddenReplies[32303] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 4, 2019 August 4, 2019 at 12:08:40 PM UTC link Permalink

> If the concern is about multiculturalism, we should encourage users to contribute
> original sentences about their culture using local names.

We should, always. But my point was that it's not enough.

> The real point is that CK is the most active user here. If you convinced him to use
> different names, so names like Sam, Joe or Jane began to spread instead of Tom and
> Mary, would it solve the issue?

It probably would solve the issue. But this is not an issue that should be handled only with CK. Because other people could keep propagating Tom and Mary. And other people could just create the same problem but with other names (although at this point, it's a bit difficult to beat Tom and Mary). The real point is to make people aware of how the choice of names impacts this project.

> I'm not bothered by the abundance of Tom sentences here.

But would you be bothered if we reversed the trend? If yes, what would bother you? Why would you be unhappy not seeing an abundance of Tom sentences anymore?

> Besides, there are other Web sites using corpora from different sources including
> Tatoeba. And not all of them properly give credit to the original sources. Tom
> sentences indirectly serve as a trademark on such places.

If our corpus has reached this point, then we have failed terribly at providing a diverse corpus.

{{vm.hiddenReplies[32307] ? 'expand_more' : 'expand_less'}} hide replies show replies
soliloquist soliloquist August 4, 2019 August 4, 2019 at 7:26:12 PM UTC link Permalink

> It probably would solve the issue. But this is not an issue that should be handled only >with CK. Because other people could keep propagating Tom and Mary. And other >people could just create the same problem but with other names (although at this >point, it's a bit difficult to beat Tom and Mary). The real point is to make people aware >of how the choice of names impacts this project.

> But would you be bothered if we reversed the trend? If yes, what would bother you? >Why would you be unhappy not seeing an abundance of Tom sentences anymore?

Although I'm not a native speaker, sometimes I create English sentences and I try to use the name Tom because CK frequently records audio for Tom sentences. I think we should take this factor into account. There are not many people contributing audio. That's my point.

{{vm.hiddenReplies[32320] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 5, 2019 August 5, 2019 at 9:12:34 PM UTC link Permalink

> I try to use the name Tom because CK frequently records audio for Tom
> sentences. I think we should take this factor into account. There are not many
> people contributing audio. That's my point.

The fact that CK contributes a lot of audio is not relevant to this topic. It cannot and should not be a factor.

I understand that there is the desire for more content. But please think about it this way: you are basically saying that whoever is the most active contributor gets to decide how the corpus looks like. It shouldn't be that way.

For instance, if I was the most active contributor of English audio and one day I decide I want to record only sentences containing "lol", do you then start adding sentences with "lol"?

I don't blame you at all for trying to increase your chances at having your sentences recorded. But this is your personal choice there. In this particular case, you see no harm with Tom sentences and you're fine creating such sentences. But other people do not feel that way, they are not okay to make the trade-off you are making.

In general, we should not have to care if our decisions displease those who have more influence, more power, more authority or more money.

{{vm.hiddenReplies[32335] ? 'expand_more' : 'expand_less'}} hide replies show replies
soliloquist soliloquist August 6, 2019, edited August 6, 2019 August 6, 2019 at 4:06:42 AM UTC, edited August 6, 2019 at 4:09:36 AM UTC link Permalink

> For instance, if I was the most active contributor of English audio and one day I decide I want to record only sentences containing "lol", do you then start adding sentences with "lol"?

I don't think 'Tom' and 'lol' are comparable, but if you recorded audio for sentences with some other short and easy-to-pronounce name instead of Tom, I would appreciate your efforts and might try to contribute sentences with that name. Why not? Having your sentences recorded by a native speaker is a good thing.


> I understand that there is the desire for more content. But please think about it this way: you are basically saying that whoever is the most active contributor gets to decide how the corpus looks like. It shouldn't be that way.

On the contrary, I'm not comfortable with it because thousands of ill-constructed, strange machine translations in Turkish have been added here for years arising from the desire for more content, and that problem is worse than what we're discussing here. The Tom issue looks like a first world problem compared to that.

But after all, that's the nature of it unless a ban or quota is imposed.


> I don't blame you at all for trying to increase your chances at having your sentences recorded. But this is your personal choice there. In this particular case, you see no harm with Tom sentences and you're fine creating such sentences. But other people do not feel that way, they are not okay to make the trade-off you are making.

You'll need to convince CK. He's the locomotive of this tradition. Other members using Tom in their sentences are rather like cars attached to the locomotive. Even if you convinced them, the locomotive could go at top speed, but without the locomotive, the trend wouldn't last long.


> In general, we should not have to care if our decisions displease those who have more influence, more power, more authority or more money.

No objections.

AlanF_US AlanF_US August 4, 2019 August 4, 2019 at 12:12:44 PM UTC link Permalink

> If the concern is about multiculturalism, we should encourage users to contribute original sentences about their culture using local names.

I would expand this to say "using local names or a variety of names from other cultures". There's no reason that English speakers need to limit themselves to Sam, Joe, or Jane. They could easily use names like Pedro and Renate as well. In fact, many English-speaking countries have so many inhabitants and/or so much influence from elsewhere that a variety of names better reflects the national culture.

{{vm.hiddenReplies[32308] ? 'expand_more' : 'expand_less'}} hide replies show replies
soliloquist soliloquist August 4, 2019 August 4, 2019 at 7:25:38 PM UTC link Permalink

>In fact, many English-speaking countries have so many inhabitants and/or so much >influence from elsewhere that a variety of names better reflects the national culture.

You're right. I'm not against that. It's something you native speakers need to discuss and settle.

Thanuir Thanuir August 4, 2019 August 4, 2019 at 6:37:23 AM UTC link Permalink

I am not sure any policing of names in sentences is worth the trouble. It always adds friction to new people when there are more rules, or, even worse, unwritten community norms like not creating the kinds of sentences one sees all the time in the database already.

If the problem is CK creating too many sentences with Tom and Mary, then discussing the matter with them might be the most constructive course of action.

{{vm.hiddenReplies[32304] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 4, 2019, edited August 5, 2019 August 4, 2019 at 12:34:04 PM UTC, edited August 5, 2019 at 8:32:51 PM UTC link Permalink

Obviously discussing the matter with CK is part of the whole thing. But as I've said to soliloquist, this is not something that should concern CK only.

Your response and soliloquist's response made me realize that my proposal was way too narrow and targeted in the end just to one contributor.

So I would like to rephrase my proposal this way:

1) We officially declare that adding sentences with the same names in large quantities is harmful to Tatoeba.
2) Translating sentences with names that have been overused is still okay. We already have all these sentences, there is no turning back now. The primary goal is to stop introducing tons of new sentences with the same names.
3) No one will be an outlaw for adding a few new sentences with names that have been overused. It's understandable that these names may still appear in some new sentences. Maybe the contributor doesn't know about our issue with these names or maybe there was a good reason for using these names in the specific context of their sentences.
4) But if someone continues to add a significant amount of sentences with the same names despite being informed, they should be reported to the community admins.
5) If someone is consciously adding near-duplicates by creating new sentences from existing ones, just replacing overused names with other names, they should also be reported to the admins. It goes against the goal of having a diverse corpus. Near-duplicates will happen, inevitably. But they should happen organically, not because someone decided to transform themselves into a bot.
6) When a sentence has not been translated yet and contains an overused name, we should explain our problem with these names and encourage the owner to use another name if possible.

The whole thing can be summarized with: make an effort to keep the corpus diverse. And this doesn't have to be an unwritten rule. We can add it on the "Add sentences" page, along with the warnings about punctuation and licenses. I'm not sure much people read it, but in any case, the main end-goal is to integrate the notion of diversity into our culture and make sure it remains for the dozens or hundreds of years ahead.

{{vm.hiddenReplies[32309] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir August 4, 2019 August 4, 2019 at 1:46:21 PM UTC link Permalink

I do agree that following the guidelines would likely improve the corpus. The cost is the greater barrier to entry. I see this mostly on various Stack exchange sites, where there are lots of community norms, many of which are easy to not encounter before adding questions or answers. Not everyone is friendly when introducing new people to the best practices. In fact, SE has done work to improve this, like adding a notice that someone is new to the website.

Would it appropriate to have a list of best practices about what kinds of sentences to add, with the name issue as one thing there?

What more could be done when someone new starts using the website? Some kind of mentoring system, informal or official?

{{vm.hiddenReplies[32311] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US August 4, 2019, edited August 4, 2019 August 4, 2019 at 3:27:06 PM UTC, edited August 4, 2019 at 3:31:53 PM UTC link Permalink

These are good questions. We could learn a lot from Stack Exchange.

There are introductory wiki pages (see the links at the bottom of any page):

https://en.wiki.tatoeba.org/art...w/quick-start#
https://en.wiki.tatoeba.org/art...ow/guidelines#
https://en.wiki.tatoeba.org/art...or-in-tatoeba#
https://en.wiki.tatoeba.org/art...ood-sentences#

but they could be friendlier. Once we come to some kind of consensus in the current discussion, we could try to make them more welcoming (and also incorporate the "keep the corpus diverse" guideline). We could also come up with a page with standard messages for existing contributors to use when encountering common issues from new contributors. For instance: "Welcome to Tatoeba! Please make sure to read the wiki pages ... " People could use messages directly from there or use them as inspiration to write their own. I would be happy to do this myself if no one else is interested.

TRANG TRANG August 4, 2019 August 4, 2019 at 5:59:59 PM UTC link Permalink

> The cost is the greater barrier to entry.

The barrier to entry will remain the same. We have never required contributors to read through our whole wiki before they start contributing. The fact that we add one more rule/guideline/policy in our Wiki won't make it more difficult for a new contributor to start contributing. The biggest barrier of entry is the lack of user-friendliness of the website, not the evolving policies of Tatoeba.

Common sense should be everything that is needed for a new member to start contributing. All the specific rules can be learned on the go.

We have to be okay with new people not always knowing what they are doing and we have to be okay with taking some time to assist them, to explain to them things they didn't figure out on their own. We should be thankful to everyone who takes the time to read our rules, but we should not expect anyone to read it firsthand. On the other side, whoever is new to Tatoeba should never have to feel sorry for not reading the wiki, but they should always feel thankful that someone helped them out.

That's the kind of environment I wish for Tatoeba. With this kind of environment, it doesn't matter how many rules and guidelines we decide to introduce. As long as they make sense, the regular members will follow them and will teach them to the new members. And the new members will then one day become regular members who will teach to the next generation of new members. And so the cycle goes.

We already have an informal system of mentoring. It happens through comments, through private messages and through the Wall too. Many regular members will instinctively reach out to the new members whenever needed, to explain how Tatoeba works.

Now due to the fact that Tatoeba is a fairly small community, we don't have any real issues with experts acting like assholes towards beginners. That's more a trait of very large communities and it's hugely amplified when you add competition into the mix (like it is the case for video games for instance). This may happen one day to Tatoeba, but I think for now we are still good and don't need to act upon it. In Tatoeba, new contributors can be easily spotted already, because there's not a big influx of new contributors. If you hang around Tatoeba long enough, you'll recognize new contributors just by name.

AlanF_US AlanF_US August 4, 2019 August 4, 2019 at 2:12:33 PM UTC link Permalink

I like this statement of policy. Just to make it clear, though: it needs to go beyond "Tom" and "Mary". CK has also been restricting other classes of words and urging others to do the same. For instance:

- language is always "French" (and you wouldn't believe how frustrating it is to encounter "French" a hundred times in Clozemaster exercises when no other language name ever comes up)
- city is always "Boston"
- month is always "October"
- day of month is always "20"

Then these restrictions are amplified in a number of ways. CK only chooses such sentences to "proofread". Then he uses this list when making recommendations as to:

- which sentences to translate
- which English sentences to add audio for
- which non-English sentences to add audio for (namely, those that are translations of the sentences that have those restrictions)

If you want him on board with the effort to keep the corpus diverse, you need him to reexamine this as well.

Finally, going beyond CK, if the policy is to keep the corpus diverse, people should be urged to contribute sentences that are not simply variations on a theme where one element in the sentence is varied, such as the pronoun ("I go to the bank. You go to the bank. He goes to the bank. She goes to the bank."). I realize that this limits the quantity of sentences an individual will contribute, but it should make the corpus more interesting and useful.

{{vm.hiddenReplies[32312] ? 'expand_more' : 'expand_less'}} hide replies show replies
maaster maaster August 4, 2019 August 4, 2019 at 3:32:20 PM UTC link Permalink

I don't find it a really good idea to write always the same property names, Perhaps, Boston is named in every languages Boston. But there are countless other town names which have different names in languages: (Mokow - Moskau - Moscou - Moszkva - Moskva - Moskwa - Moscú.., Milan - Milano - Mailand - Mediolan..., Dunkerque - Dunkirk - Duinkerken..., Paris - Parijs - Parigi...; London - Londra - Londres - Londen - Londyn..., Gdan'sk - Danzig - Gdanzica... and many more. It can perhaps be interesting too.
And there are many other geogrophical names that one can't find on Tatoeba at all.

I think it's irrelevant that one's called Tom, Fadil or XY. If someone maybe translates John to Tom or Sami, actually it's not an important thing.
However, the 90% of the sentences are about Tom (and Mary) and contain 4 or 5 words. (They can be translated into Hungarian or Turkish as 2- or 3-word-sentences.) So, the 95% of Tatoeba corpus are written to beginners. I think that's why the 95% of new contributors quit Tatoeba after rhe second or third day - since they can learn nothing using it.
However, the most simple sentences are still the most famous ones. If we want to write a billion sentences, they're really good. If someone wants to learn by Tatoeba, well, they are not really useful.
Since the 95% of Tatoeba corpus are original Eng. sentences or translations of English sentences, I encourage the native English speakers to write sentence in every fields of life (surgery, welding, breeding, banking, trade, travelling by train etc.) and also a bit longer and bit complicated sentences as well.
If I choose a simple word for searching and translating it, I find 280 sentences containing 4 or 5 words and the last 5 ones contain rows between 5 and 15. I.e., there are no sentences written for advanced contributors.
No doubt, many users are tired of Tom, Mary, John (doing something), Boston, French (teacher/learning), homework, tee, coffee, going out, restaurant etc.

{{vm.hiddenReplies[32315] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir August 6, 2019 August 6, 2019 at 5:16:35 AM UTC link Permalink

I agree, though also short phrases can be valuable, when they are set expressions or use grammar that is specific to the language.

I especially agree about field-specific vocabulary. There is far too little of it, and such terms are difficult to find translations for. Examples of use are also very helpful.

{{vm.hiddenReplies[32337] ? 'expand_more' : 'expand_less'}} hide replies show replies
maaster maaster August 7, 2019 August 7, 2019 at 6:00:02 PM UTC link Permalink

Yes, they are as well..
If I start to learn a next language, I like to use the short sentences.
But if I'm no longer a beginner, I want to see the usual sequence of adverbs & attributives, the building of compound sentences (in some languages there are strict word sequence of clauses), the colloquial things and things like onomatopoeic words and eh, ugh, oh etc.
(I find the dialogues the best way to express those things. Using a dialogue you can interpret e.g. an adverb in order other ones not to have to use Google to search what that idiom means. )

TRANG TRANG August 4, 2019 August 4, 2019 at 6:51:51 PM UTC link Permalink

> Just to make it clear, though: it needs to go beyond "Tom" and "Mary".

Definitely. We would start with Tom and Mary because they are the most obvious problem, for which, I think, we can say objectively "we have too much of these sentences". But we can extend the policy to anything really.

Still, before we do anything concrete, we should wait a bit more to see how this discussion unfolds. I wouldn't say we have a final decision yet.

I mean, promoting diversity in the corpus is completely fine, we can readjust the wiki to emphasize on that.

But declaring that mass contributions of overused words is against our rules, in the same level as contributing a high amount of racist sentences would be against our rules, can still be argued. It's a pretty radical decision after all, and I'd like to make sure we're overall comfortable with that level of restriction.

For me it is okay, because after all, we stop people from contributing when they contribute too many bad quality sentences (in grammar, in spelling). Here with Tom and Mary, it's the same, but a little bit more subtle. It doesn't impact quality in a textbook kind of way, but it does impact quality on the emotional level. Our corpus feels like microwaved food, or eating at a fast food. It's eatable, right? But it doesn't feel that great. And I hope we can instead have something home-made, something that feels authentic and not industrialized.

But that's my own perspective and I would actually like to hear more from the people who enjoy creating or translating Tom and Mary sentences. What is their perspective? Are we giving up on something else important by establishing these policies to promote diversity?

{{vm.hiddenReplies[32317] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US August 5, 2019 August 5, 2019 at 12:58:05 PM UTC link Permalink

@CK, are you interested in participating in this discussion?

AntonKhorev AntonKhorev August 4, 2019 August 4, 2019 at 10:20:44 PM UTC link Permalink

- quotation is always from the Bible

AlanF_US AlanF_US August 5, 2019 August 5, 2019 at 9:03:42 PM UTC link Permalink

For what it's worth, here's the full list of rules that CK uses:

http://a4esl.org/temporary/tato...wildcards.html

{{vm.hiddenReplies[32334] ? 'expand_more' : 'expand_less'}} hide replies show replies
Objectivesea Objectivesea August 6, 2019, edited August 6, 2019 August 6, 2019 at 6:36:34 AM UTC, edited August 6, 2019 at 7:54:13 AM UTC link Permalink

Thanks, @AlanF_US, for kindly posting the link to @CK's complete list of Wildcards, i.e., the names he uses and recommends to others, like Tom, Mary, Boston, Australia, Park Street, etc. I had not seen this full list before, but having seen it, I am now thoroughly convinced of the great benefit of using a single set of names for the basic Tatoeba corpus. In the right column of that page, http://a4esl.org/temporary/tato...ildcards.html, there are links to three marvellous JQuery demostrations that show how easy it can be to perform automatic substitutions on the corpus, so that, for instance, every instance of 'Tom' can be instantaneously replaced with 'Fadil' (or 'Sami', or anything else). One would only need to modify the JQuery code to perform the desired replacement.

I know that @TRANG is super-busy with coding the wonderful overall Tatoba platform, but it might not be too far-fetched to imagine the creation of a number of similar substitution routines which could be user-selected from a drop-down menu. Someone who wants to translate a selection of Tatoeba English sentences to French, Tagalog or Swahili or whatever could then apply the same JQuery routine to the Tatoeba equivalents to create derivative French, Tagalog or Swahili entries linked to the English versions that use her choice of alternative names. From the one master Creative Commons database, users could create, for example, a smaller open-source bilingual corpus using their choice of substitutes for the wildcards.

Doing this sort of automatic one-to-one substitution is far easier than trying to recreate a vast number of nearly identical sentences but having to search for random names instead of a small canonical set of fixed size. Admittedly, translators into slavic languages will need to implement some sort of grammar parser so that they can determine the role of Tom, Mary and Boston in each individuial sentence in order to provide correct case endings. This is a difficult task but not an insurmountable one.

Here's a delightful example showing how easily actor names and language names can be changed, provided we start with the basic wildcards; I found the demo very compelling: http://a4esl.org/temporary/tato...eas/wildcards/

Let's not mess with a good thing. What CK has provided the Tatoeba volunteer community, in addition to his astonishing contribution of nearly 570,000 sentences, is a practical mechanism to minimize duplications and chaos and to maximize utility, both actual and potential. I believe that we should formalize CK's recommendations as an official Tatoeba recommendation.

{{vm.hiddenReplies[32338] ? 'expand_more' : 'expand_less'}} hide replies show replies
User55521 User55521 August 6, 2019 August 6, 2019 at 7:40:46 AM UTC link Permalink

> One would only need to modify the JQuery code
> to perform the desired replacement.

Try to modify the code to do the following replacement (replacing Mary with Layla):

> #419053 **Мэры** цікавіцца палітыкай. > **Лайла** цікавыцца палітыкай.
> #7705629 Том усё яшчэ вінаваціць цябе ў смерці **Мэры**. > Том усё яшчэ вінаваціць цябе ў смерці **Лайлы**.
> #566844 **Мэры** цікавая палітыка. > **Лайле** цікавая палітыка.
> #419054 **Мэры** цікавіць палітыка. > **Лайлу** цікавіць палітыка.
> #431693 Мы сустракаемся з **Мэры** сёння пасля абеду. > Мы сустракаемся з **Лайлай** сёння пасля абеду.
> #2658610 Том закахаўся ў **Мэры**. > Том закахаўся ў **Лайлу**.

It's next-to-impossible to create a fully working algorithm because in some cases, two options would be correct (Мэры сказала Эн can mean either 'Mary told Ann' or 'It was Ann who told Mary'; so substitutions would be either Лайла сказала Эн or Лайле сказала Эн, depending on the meaning).

> Doing this sort of automatic one-to-one
> substitution is far easier

No, it's not. It's easy in English, but extremely difficult in some other languages.

{{vm.hiddenReplies[32339] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US August 6, 2019 August 6, 2019 at 1:35:46 PM UTC link Permalink

> No, it's not. It's easy in English, but extremely difficult in some other languages.

Just to give even more of an idea of what Impersonator is talking about:

In Russian, there are six (major) cases. The endings for these cases change according to gender and number, as well as according to the original ending of the name (in the nominative, or "dictionary" form). An automated inflector would need to identify proper names (because you can't rely on capitalization to determine whether a word at the beginning of a sentence is one), identify their cases based on the rest of the sentence, transform the existing word, and apply the ending. Even the reverse problem, developing a stemmer to remove the ending from inflected words, which is far easier, is on the order of Ph.D thesis work, and stemmers still make lots of mistakes. And we've just been talking about one language. So the claim that automatic substitution is simple can only be made if one ignores an enormous chunk of the languages on the planet.

Hybrid Hybrid August 7, 2019 August 7, 2019 at 12:29:51 AM UTC link Permalink

Thank you. I think that's why CK wants to use Wildcards. We could ask Horus to create new sentences using different names based on existing sentences. It's like using variables in a mathematical formula.

I also think that we should thank CK for his contribution to Tatoeba. He has added a lot of sentences and spent a lot of time proofreading English sentences. Thank you CK.

I think that everyone should be free to add sentences with Tom and Mary.

I also think that everyone should be free to add sentences with different names.

TRANG TRANG August 7, 2019 August 7, 2019 at 9:44:50 PM UTC link Permalink

> Here's a delightful example showing how easily actor names and language
> names can be changed, provided we start with the basic wildcards;
> I found the demo very compelling: http://a4esl.org/temporary/tato...eas/wildcards/

You are under the wrong influence here. You are seeing this shiny feature and thinking to yourself how it would be awesome to have it. But as many people before you, you fall into this trap: you have been convinced that you should adjust the content of the corpus so that the feature is easier to implement. This is a very bad approach.

Please do not take it as an offense, but as a food for thought.

We don't want to be restricted in the names we can use in our sentences just because it makes it easier to implement a substitution algorithm. We would rather adapt the algorithm to be smart enough to handle any name we may come up with. Not only it is more rewarding intellectually, but it has long term benefits.

Why should we spend just one day coding a simplistic substitution algorithm and then struggle for the rest of our lives to convince every new person who joins Tatoeba to use wildcard names, when could spend one month coding an advanced substitution algorithm and not worry for the rest of our lives about what names people use?

Generally speaking, we don't want to build software to become slaves of it. We aim build software that serves us and adapts to us. I'm not the first one to say this and I won't be the last. We will repeat this until it is clear in everybody's mind.

> I believe that we should formalize CK's recommendations as an
> official Tatoeba recommendation.

Please have a look at my post below and let me know if you still believe that CK's recommendation are a good thing:
https://tatoeba.org/eng/wall/sh...#message_32369

While I'm sure these recommendations were coming from good intentions, they are absolutely not a good solution to minimize duplications and chaos and to maximize utility.

{{vm.hiddenReplies[32377] ? 'expand_more' : 'expand_less'}} hide replies show replies
Objectivesea Objectivesea August 9, 2019 August 9, 2019 at 3:49:19 AM UTC link Permalink

Thank you, #TRANG, for an interesting and thoughtful answer. Ultimately, I will follow whatever policies Tatoeba lays out, as I think the concept and its (inevitably less than perfect) execution are highly worthy efforts in their own right. Recognizing that we have limited programming staff (basically yourself, plus any donated help from things like Google Summer of Code), I had assumed that constructing parsers and part-of-speech taggers was beyond the scope of what was possible. However, if it is indeed feasible, yours would seem to be the most elegant solution. Good luck on these ambitious projects.

jegaevi jegaevi August 4, 2019 August 4, 2019 at 6:52:09 AM UTC link Permalink

I agree. My only problem is that the majority of the sentences that have audio are Tom and Mary sentences. I know it's allowed to translate these, but wouldn't I be blamed if I only translated these ones? Since I mostly translate English sentences with audio.

{{vm.hiddenReplies[32305] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 4, 2019 August 4, 2019 at 12:42:03 PM UTC link Permalink

Whoever decides to blame you for that is in the wrong. If you want to translate sentences in English with audio, it's your choice and it's a perfectly reasonable choice. If the sentences happen to be Tom and Mary sentences, that's not your fault. If you don't translate them, someone else will anyway, someday. So it would make no sense to blame you or anyone for that matter.

deniko deniko August 5, 2019 August 5, 2019 at 2:17:37 PM UTC link Permalink

> Let me know what you think.

I think all our rules and guidelines have always been very reasonable that I have always been able to relate to.

Full sentences, make sure the sentence sounds fine and natural in your language, etc.

They are both specific and quite general, hence brilliant.

Banning a single name or a set of names would be a different kind of a rule. Encouraging users to use different names is one thing, banning a contributor for using "Tom" and "Mary" is a bit off.

Of course it would be quite funny and unusual to join those who were banned because you used the name Tom one time too many. The new "we suffered for Tom" club.

Anyway, I've got a few questions.

> No one will be an outlaw for adding a few new sentences with Tom and Mary.

Can you quantify that? I don't want to become an outlaw, so, once you establish the new rules, how many will I be able to add? Is 5 sentences a day reasonable?

Also, is it only about English? Will someone become an outlaw for adding new sentences with Tom in, say, toki pona or Scots?

Besides, I have something to say in defense of Tom and Mary, but I'll write it as a separate message, as this one is getting too long.

{{vm.hiddenReplies[32324] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 5, 2019 August 5, 2019 at 3:17:11 PM UTC link Permalink

> I don't want to become an outlaw, so, once you establish the new rules, how many
> will I be able to add? Is 5 sentences a day reasonable?

This is not something we can quantify. Just like we can't really quantify how many sentences should someone contribute before they are allowed to become advanced contributor. We have to look at the whole situation.

But this is also not about setting up a quota of how many sentences with Tom and Mary someone is allowed to add per day. If we come to an agreement that having more of these sentences is becoming poisonous, then zero per day is the goal.

> Also, is it only about English? Will someone become an outlaw for adding new
> sentences with Tom in, say, toki pona or Scots?

No, this is not only about English.

I suppose you have not followed up with the other responses in this thread. While my question was about stopping Tom and Mary sentences, the broader issue really is diversity.

As I've said to Thanuir, the whole thing can be summarized with: make an effort to keep the corpus diverse.

And as I've said to Alan, we would start with Tom and Mary because they are the most obvious problem, for which, I think, we can say objectively "we have too much of these sentences". But we can extend the policy to anything really.

{{vm.hiddenReplies[32326] ? 'expand_more' : 'expand_less'}} hide replies show replies
deniko deniko August 5, 2019, edited August 5, 2019 August 5, 2019 at 3:23:09 PM UTC, edited August 5, 2019 at 3:23:30 PM UTC link Permalink

> This is not something we can quantify.
...
> If we come to an agreement that having more of these sentences is becoming poisonous, then zero per day is the goal.

So it does seem you can quantify that.

> If we come to an agreement

Do you mean, if you come to agreement with yourself? Or will there be a vote? If it's a vote, who will be allowed to vote?

{{vm.hiddenReplies[32327] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 5, 2019 August 5, 2019 at 6:30:27 PM UTC link Permalink

> So it does seem you can quantify that.

Indeed :) But you could be adding a hundred sentences with Tom one day, if you had a very good reason for it (that we could not anticipate), that wouldn't be against the rules. That's what I meant by not being able to quantify.

> Do you mean, if you come to agreement with yourself? Or will there be a vote?
> If it's a vote, who will be allowed to vote?

No, I mean with the community. This is the whole point of this discussion on the Wall. It gathers opinions and arguments.

Additionally, there will be an analysis of the corpus to check who has been creating sentences with Tom/Mary, and the contributors will be contacted privately to make sure they didn't miss out on this discussion and to make sure they have a more comfortable space to share their concerns.

There will be no vote. If one person brings an irrefutable argument in favor of continuing the the mass-addition of Tom and Mary sentences, then we will not take action.

No matter the outcome, this will be synthesized and documented in some ways (perhaps on the blog, perhaps in the wiki).

Whatever policies we decide to establish, it needs to be founded on solid grounds. But some topics are complicated, and I think this one is. There's no policy that will make everyone happy, but at least we need to know why we took the decision. Then we can see how it develops, come back to it some months or years later, and evaluate if the reasons we provided at the time are still valid. If not, we try something else.

deniko deniko August 5, 2019, edited August 5, 2019 August 5, 2019 at 2:33:38 PM UTC, edited August 5, 2019 at 4:45:25 PM UTC link Permalink

> From what I have gathered, the main reasons for using Tom and Mary revolve around avoiding near-duplicates. But we don't have evidence that near-duplicates really are that much of an issue.

I don't think that near duplicates is an issue.

In my experience I encountered a few situations like the following:

I see 6-7 sentences linked together, translate one into Ukrainian, and boom! I'm linking them to a family of 9-10 sentences in completely unrelated languages. Both sets stemmed from different original sentences in different languages, were growing on its own, and then just "magically" merged into a single linked family. This is really beautiful, I'm not ashamed to say that when I witnessed something like this I was quite emotional. This will still be possible with sentences without names, of course, but will become statistically less probable when we have diverse names.

Another point - it's quite easy to translate a sentence from English to say Dutch no matter what the name is. "Tom is there"->"Tom is daar", "Xing is there" -> "Xing is daar", "Zwzxcvbdgftrpl is there" -> "Zwzxcvbdgftrpl is daar". You don't care how the name is pronounced, how it should be spelled - just follow the original spelling. When I translate something into Ukrainian, translating names is always a problem. We had a lot of problems with Sami, first translating him as Семі, then realizing it should be translated as Самі, and manually changing dozens if not hundreds of sentences just for this. Same with Fadil. I don't even want to start translating sentences with Mennad now because figuring out how to translate this name is a big time investment. And while it's VERY interesting to do some investigation on translating a word or expression, it doesn't feel worth it with names.

Having a lot of Tom and Mary's around is nice because it feels like translating a book with two main characters. The characters are already well known, I feel comfortable around them, but it doesn't make the book's vocabulary any less diverse comparing to a hypothetical book that changes main characters name from page to page. The diversity is in words and expressions, not in names. And of course there are a lot of less important characters around, it's not like we have only Tom and Mary here.

I have never been against other names, by the way. I supported OsoHombro when he started adding sentences with different names, and I understand his reasons - Sami and Layla just work better for Arabic, and his main concern is the Arabic corpus.

So I'm against banning any name.

I think it would be enough to add something like this to the FAQ:

"While there are a few names that seem to dominate the Tatoeba corpus, we encourage our contributors to use names native to their languages when contributing new sentences" - or something like this, in proper English, clearly showing that "Tom and Mary" are not only the official policy, but also mildly discouraged.

However, any ban on any name would be very unreasonable, I believe. Basically, you would be punishing the most productive contributor only because he's the most productive. A victim of his own success.

EDIT: A couple words about diversity. This topic does bother me. 75% of all Ukrainian sentences were added by me, and it is a problem because any individual person's active vocabulary is more limited comparing to 100 person's active vocabulary. I would love to have more active contributors in Ukrainian.

It's not that bad in English, of course, but CK's share is also very substantial - I believe it's about 50%. Do you think the English corpus would have been more diverse if CK had used random names in each of his sentences, as opposed to Tom and Mary? Personally, I don't believe so. I believe it would have been less diverse, because there would have been thousands of duplicates like "Tom is here", "Jenny is here", "Martha is here", etc, all from CK.

To make the corpus more diverse Tatoeba needs to attract more contributors of different backgrounds.

I believe this is more important that waging a war on a couple of names.

{{vm.hiddenReplies[32325] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 5, 2019 August 5, 2019 at 6:11:02 PM UTC link Permalink

> This will still be possible with sentences without names, of course, but will become
> statistically less probable when we have diverse names.

This problem can be solved by extending how we link sentences. The following issues on GitHub hold ideas that would lead us toward a solution for not missing out on potential links:

https://github.com/Tatoeba/tatoeba2/issues/1902
https://github.com/Tatoeba/tatoeba2/issues/44

And yes, this might take a while until we come up with a good complete solution, but there is a way. And if anything, having more and more diverse names will be an greater incentive for us to work on a concrete solution.

> When I translate something into Ukrainian, translating names is always a problem.
> ...
> We had a lot of problems with Sami, first translating him as Семі, then realizing it
> should be translated as Самі, and manually changing dozens if not hundreds of
> sentences just for this. Same with Fadil.

Why is one wrong and the other not though? Why can't both be correct? If it's about following the pronunciation the name, then I would not have called Семі wrong. In the same way that a word can be ambiguous in meaning, a name can be ambiguous in pronunciation.

> And while it's VERY interesting to do some investigation on translating a word or
> expression, it doesn't feel worth it with names.

I don't know how name translations are handled in Ukrainian, but I have to wonder if you might have over-complicated the problem. I think there is a lot of flexibility in name translations, at least in the languages that I know. While it's always a nice gesture to spend time trying to figure out the best translation, it is okay as well to just go with the translation that feels the most sensical to you, without doing intensive research.

> Having a lot of Tom and Mary's around is nice because it feels like translating a
> book with two main characters.

There's obviously nothing wrong with having many sentences with the same name. But at some point don't you feel it's too much? I mean, consider again the numbers: 368k sentences contain Tom.

In the early days of Tatoeba, one of our contributors was creating sentences with a fictional character named Dima. It all started here:
https://tatoeba.org/eng/sentences/show/477989

The type of sentences in there was perhaps not for everyone, but they had a fun and creative aspect. There were in total 90ish sentences (in English). And this was enough, I think, to give this feeling of translating a book with a main character.

I understand Tom and Mary are sort of a "comfort zone" by now, but it seems you still have plenty of content to translate in that comfort zone. Right now, there are 337225 sentences in English with "Tom", not translated into Ukranian. There are still plenty of Tom sentences to translate in every language. Do you really lose (do we lose) anything if we put a stop at the propagation of Tom sentences?

> Basically, you would be punishing the most productive contributor only because
> he's the most productive. A victim of his own success.

Just to be clear, we are not talking about banning CK here. That did not even cross my mind. I don't think he's the kind of person who would start vandalizing the corpus due to a decision that he didn't like. He may contribute less sentences, or may decide to completely stop contributing, but I don't think we will ever get to the point of banning him.

What I hope, on the contrary, is that he keeps contributing, but re-adapts the way he contributes so that we can eventually have a corpus that feels more interesting for everyone. To my knowledge, CK is a human with adaptive skills. He is not going to become disabled and dysfunctional because he can't add sentences with Tom and Mary anymore... Is he?

We are not looking for punishment, we are looking for compromise.

To give you an analogy, imagine how it would feel if a user was adding erotic sentences. Just a few sentences might be okay, but there is a point where it becomes too much and we have to ask them to stop because a part of the community isn't feeling comfortable with it.

Something similar is happening here. There are too many Tom and Mary sentences, to the point that many contributors started to feel more and more discomfort. This is not a new phenomenon, it can probably be traced back to a couple of years ago, if not more (time flies, I don't keep track anymore).

Under these circumstances, do you still think it is unreasonable to ask people to stop adding more Tom and Mary sentences? And in the broader picture, is it unreasonable to ask people to make an effort to keep the corpus diverse by avoiding names, words, topics that have already been overly illustrated?

{{vm.hiddenReplies[32329] ? 'expand_more' : 'expand_less'}} hide replies show replies
deniko deniko August 6, 2019 August 6, 2019 at 8:39:21 AM UTC link Permalink

Thanks for elaborating, Trang.

> Under these circumstances, do you still think it is unreasonable to ask people to stop adding more Tom and Mary sentences? And in the broader picture, is it unreasonable to ask people to make an effort to keep the corpus diverse by avoiding names, words, topics that have already been overly illustrated?

You raised a few good points, but it still feels wrong to ban any name for the reasons I explained above - I don't feel like more names result in more diversity, but even this is true, I don't believe banning a name is compatible with the web site values and that will make Tatoeba quite hostile.

Your "erotic sentences" analogy is good, but not perfect. Currently, you can easily do a search on anything you wish avoiding any sentences with "Tom" and "Mary", just add "-Tom -Mary" to the end of search string. And voila, you're basically exploring the sub-corpus of sentences that are not about Tom and Mary. There's no way I can avoid erotic sentences, or biblical quotes in my searches. If we had 50% of corpus filled with erotic or biblical sentences, and if there was an easy way to limit my searches not to include them, I wouldn't have problems with them.

I feel like nothing will sway your mind, but please do consider my suggestion to clearly declare that Tom and Mary are not official Tatoeba names, moreover, contributors are not encouraged to use these names here. I think part of the problem is that a new contributor feels obliged to use those names because they're so abundant, thinking they're somehow official, but if it's clear they're not, the tendency could be reversed. You can at least start soft as opposed to the harsh move of banning those names.

Of course I have my egoistic reasons to write about this, I admit. I feel like I'll be one of the first people to be banned, and I don't like being banned and I do like translating here on Tatoeba.

{{vm.hiddenReplies[32340] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 6, 2019 August 6, 2019 at 10:29:39 AM UTC link Permalink

> There's no way I can avoid erotic sentences, or biblical quotes in my searches. If we
> had 50% of corpus filled with erotic or biblical sentences, and if there was an easy
> way to limit my searches not to include them, I wouldn't have problems with them.

This is a fair point. There's no automated way to avoid erotic sentences while there is an automated way to avoid Tom sentences, this is true. Technically, it would be possible if erotic and biblical sentences were all tagged, but it does require the manual process of tagging and it requires this issue to be implemented: https://github.com/Tatoeba/tatoeba2/issues/1333.
The tagging could probably be semi-automated. We can check sentences against the Bible and see if we find it. We can have a set of vocabulary that is likely to be erotic. But it's definitely more effort than excluding specific names.

> I feel like nothing will sway your mind, but please do consider my suggestion to
> clearly declare that Tom and Mary are not official Tatoeba names, moreover,
> contributors are not encouraged to use these names here.

How else to make it clear other than stopping ourselves to introduce more Tom and Mary sentences? We can declare anything, but if we act the opposite way, if we keep feeding the corpus with these names, doesn't it look contradictory to the eyes of a new contributor?

Any good argument can sway my mind, believe me. But this is not my mind that you need to sway, it is the mind of everyone who feels annoyed and bored and frustrated and overwhelmed by these Tom and Mary sentences. These sentences are not a problem for me, personally. Just like erotic sentences are not a problem for me, personally. But I cannot ignore that it is a problem for others.

> Of course I have my egoistic reasons to write about this, I admit. I feel like I'll be
> one of the first people to be banned, and I don't like being banned and I do like
> translating here on Tatoeba.

I'm not sure how terrible my communication skills are, that you ended up thinking you'd be the first people to be banned. Banning is an extremely, extremely rare thing in Tatoeba. This might something "normal" in other communities, but here, this is the last of our last resort.

I would like to make it very clear that in Tatoeba, the role of community admins is a role of mediator more than a role of judge. So when I talk about reporting to community admins, it is not about banning or not banning. It's about making sure that the member who is being reported understands the boundaries of Tatoeba.

We ban only when we feel there is no more hope and we have no more time. But as much as possible, we try to trust in people cooperativeness.

I encourage you to read Alan's post:
https://tatoeba.org/eng/wall/sh...#message_32331

{{vm.hiddenReplies[32341] ? 'expand_more' : 'expand_less'}} hide replies show replies
deniko deniko August 6, 2019 August 6, 2019 at 1:39:40 PM UTC link Permalink

Thanks again for your detailed answer.

User55521 User55521 August 6, 2019 August 6, 2019 at 2:11:16 PM UTC link Permalink

> Currently, you can easily do a search on anything
> you wish avoiding any sentences with "Tom" and
> "Mary", just add "-Tom -Mary" to the end of search
> string

I used to find sentences to translate via 'Contribute' > 'Translate Sentences' and 'Random sentence' features, there is no way to filter out Toms there. T_T

{{vm.hiddenReplies[32346] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 6, 2019 August 6, 2019 at 8:38:13 PM UTC link Permalink

Hmm, I didn't realize it was not possible to make a search that only excludes a specific word... It seems you have to include a word in order to be able to exclude one.

This doesn't work:
https://tatoeba.org/eng/sentenc...o=&sort=random

But this works:
https://tatoeba.org/eng/sentenc...o=&sort=random

So it's not perfect, but you could use a very common word (like "the") in combination with "-Tom" when searching for sentences to translate.

TRANG TRANG August 6, 2019 August 6, 2019 at 10:41:10 PM UTC link Permalink

> Do you think the English corpus would have been more diverse if CK had used random
> names in each of his sentences, as opposed to Tom and Mary?

Not a great deal more diverse, but a bit yes.

> Personally, I don't believe so. I believe it would have been less diverse, because
> there would have been thousands of duplicates like "Tom is here", "Jenny is here",
> "Martha is here", etc, all from CK.

Actually if CK could provide some insight of how many duplicates he actually avoided thanks to using only Tom and Mary, that would be insightful. Not just CK, but anyone. How often does it really happen, that a duplicate is being avoided thanks to Tom and Mary?

I assume it is really not a significant amount, and as Aiji argued, near-duplicate are not a problem.

In what you describe though, the lack diversity and near-duplicates are not correlated. Even if there was no near-duplicates created, there would still be a problem of diversity. As several people pointed out, there's something about the complexity of the sentences.

I think personally that CK has put too much focus on quantity. The type of sentences he creates generally have simple structures and are clearly targeted to beginners in English. To be fair, everyone does it. It's much easier to create simple sentences. But CK has made it an industrialized process. He has been mass-producing sentences for years now. And that's the other factor in this diversity problem.

> To make the corpus more diverse Tatoeba needs to attract more contributors of
> different backgrounds.

I completely, wholeheartedly agree, but do you have any concrete idea for that? If you think the strategy to stop Tom and Mary sentences can be replaced by another strategy that will have a greater more beneficial impact, then please make a proposal.

The rational behind the Tom and Mary proposal is that in order to be more attractive to people from different backgrounds, our process of creating content must show that we embrace diversity. But the fact that we have a high occurrence of a same name is contradictory to that. We are sending mixed signals.

Integrating diversity will be an ongoing struggle. Avoiding Tom and Mary would be only the first step. It makes people become conscious that repetitiveness of a same word devalues the corpus. It makes people conscious that it is a disservice for many languages to be restricted to just a few names, because these languages have linguistic rules that are much more complex than English and many contributors cannot express the full properties of their languages with just Tom and Mary.

If you have a more powerful strategy, then I'm 100% happy to follow it. But you have propose something, a clear plan. Because this is all I've got.

Also, it saddens me that you see this as a war. I hope we can all take this instead as an experiment. Would that help if we put a deadline on it? If we say "for the next six months, no Tom, no Mary, let's try and see what happens". Would it make it more reassuring to step into this new territory?

{{vm.hiddenReplies[32349] ? 'expand_more' : 'expand_less'}} hide replies show replies
brauchinet brauchinet August 7, 2019 August 7, 2019 at 8:09:18 AM UTC link Permalink

Maybe mass-production of sentences was a good strategy in the beginning.

1.) Limit the number of sentences a member can submit per day. This is surely very controversial. Personally, I would be very happy. About 100 sentences per user,I'd say. No mass import of sentences. This would make it clear that it is not (no longer) about quantity. There would be more time for proof-reading. The main problem to diversity, - that many languages have corpuses where one single member has contributed more than a half of the all sentences - could at least be mitigated.
2.) Encourage people to submit their own sentences, (even) without translation. Many members seem to think that this is unwanted or useless (“a collection of sentences and their translations” -some might even think: a collection of English sentences and their translations) and feel obliged either to limit themselves to translating or to submitting sentences they can translate into a second language themselves. Sure, stand-alone sentences Can be useless, but it’s one way to avoid that a language’s whole corpus consists of translated English sentences. Even if they are good, they will not represent the richness of this language.

jegaevi jegaevi August 5, 2019, edited August 5, 2019 August 5, 2019 at 5:33:10 PM UTC, edited August 5, 2019 at 5:38:31 PM UTC link Permalink

What if there was an option to 'tag' words? For example:
(Tom;name) likes cake.
(Jenny;name) likes cake.
(Mennad;name) likes cake.
These near duplicate sentences would be grouped together automatically (maybe linked by a third type of link). So the corpus would be more diverse without having to ban Tom and Mary and without lots of near duplicates. It could be applied to other wildcards such as Boston, Australia, Monday or french.

I know nothing about coding, maybe this would be really hard to do. It's just an idea, maybe it can not be done at all.

{{vm.hiddenReplies[32328] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 7, 2019 August 7, 2019 at 12:02:44 AM UTC link Permalink

What you suggest is possible technically, but doesn't fit into our principle of not having annotations in the sentence. There are other possible technical solutions to group near-duplicates though.

Still, as it has been said, our problem is actually not near duplicates. It would definitely be nice to be able to group them, so that search results are more user-friendly to navigate through, but it doesn't solve the question of diversity at its root.

User55521 User55521 August 5, 2019, edited August 5, 2019 August 5, 2019 at 8:00:55 PM UTC, edited August 5, 2019 at 8:09:03 PM UTC link Permalink

I've been long opposed to name unification, so it comes as no surprise that I wholeheartedly support the efforts to bring different names into Tatoeba.

I've written a lot of rational arguments about names in https://tatoeba.org/eng/wall/sh...#message_32144 , so let me be a bit irrational now.

Names are magical.

It is a common motif around the world that saying someone's name is an invitation for them to come, even if they don't hear it. The sentences with non-English names are magical spells that will hopefully ensure that different people will join us.

This can be rationalised, of course (if someone comes to Tatoeba and see their name, or a name from their culture, they're more likely to feel welcome here and stick around), but let's be irrational a bit. Language is magical, and names are the most magical part of the language.

DostKaplan DostKaplan August 6, 2019 August 6, 2019 at 9:05:05 PM UTC link Permalink

In that case provide a list of alternative, culture-embracing names. Let's say Tom, Ali, Pierre, and Vishal for men; Mary, Lela, Claire, and Kamala for women. When we search for: "Tom likes Mary", make sure the code also brings back matches for "Ali likes Claire."

{{vm.hiddenReplies[32348] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US August 6, 2019 August 6, 2019 at 10:46:55 PM UTC link Permalink

Definitely an improvement. But is this list meant to be representative of a longer one, or is it the whole list? If you're talking about only these eight names, this approach would partially address the unicultural problem, but not the lack-of-diversity-in-names problem. And since none of the names come from a Slavic or otherwise name-inflecting language, it wouldn't solve the lack-of-inflection problem. For that reason, if we were going to use a list, I would hope it would be considerably longer.

{{vm.hiddenReplies[32350] ? 'expand_more' : 'expand_less'}} hide replies show replies
DostKaplan DostKaplan August 6, 2019, edited August 6, 2019 August 6, 2019 at 11:28:12 PM UTC, edited August 6, 2019 at 11:28:57 PM UTC link Permalink

It was just an example. The list obviously has to be lot longer. Alternatively accept also placeholder names like M1 and F2:

M1 told F1 that F2 wouldn't be coming to her party because F2's boyfriend M2 was going to take her to P1. :-)

{{vm.hiddenReplies[32351] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 7, 2019 August 7, 2019 at 12:03:57 AM UTC link Permalink

Hmm... Let's not start diving into the gender problem...

AlanF_US AlanF_US August 7, 2019 August 7, 2019 at 2:05:32 AM UTC link Permalink

I like the idea of writing sentences with "M1" and "F2". Now those are names we can all agree on. :)

Hybrid Hybrid August 7, 2019 August 7, 2019 at 12:05:35 AM UTC link Permalink

You should not punish people because they add sentences with Tom and Mary, or because they don't. People should be allowed to do what they want.

{{vm.hiddenReplies[32354] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 7, 2019 August 7, 2019 at 12:11:28 AM UTC link Permalink

Extract from https://tatoeba.org/eng/wall/sh...message_32329:

We are not looking for punishment, we are looking for compromise.

To give you an analogy, imagine how it would feel if a user was adding erotic sentences. Just a few sentences might be okay, but there is a point where it becomes too much and we have to ask them to stop because a part of the community isn't feeling comfortable with it.

Something similar is happening here. There are too many Tom and Mary sentences, to the point that many contributors started to feel more and more discomfort. This is not a new phenomenon, it can probably be traced back to a couple of years ago, if not more (time flies, I don't keep track anymore).

Under these circumstances, do you still think it is unreasonable to ask people to stop adding more Tom and Mary sentences? And in the broader picture, is it unreasonable to ask people to make an effort to keep the corpus diverse by avoiding names, words, topics that have already been overly illustrated?

{{vm.hiddenReplies[32355] ? 'expand_more' : 'expand_less'}} hide replies show replies
Hybrid Hybrid August 7, 2019, edited August 7, 2019 August 7, 2019 at 1:12:54 AM UTC, edited August 7, 2019 at 1:16:58 AM UTC link Permalink

I think that there should be filters to exclude different kinds of sentences. For example, on Google, you don't see all the webpages but only the ones that match the words that you searched for.

So maybe users could see a custom homepage that would have more of certain kinds of sentences (more "Anne" sentences) and less of certain other kinds of sentences (less "Ann" sentences).

{{vm.hiddenReplies[32358] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 7, 2019 August 7, 2019 at 10:28:20 PM UTC link Permalink

But we don't have such a feature today and even if we started thinking about implementing it today, we may not it before another 2-3 years.

So I have to ask again: under these circumstances, do you still think it is unreasonable to ask people to stop adding more Tom and Mary sentences? And in the broader picture, is it unreasonable to ask people to make an effort to keep the corpus diverse by avoiding names, words, topics that have already been overly illustrated?

{{vm.hiddenReplies[32380] ? 'expand_more' : 'expand_less'}} hide replies show replies
Hybrid Hybrid August 8, 2019 August 8, 2019 at 2:17:22 AM UTC link Permalink

>So I have to ask again: under these circumstances, do you still think it is unreasonable to ask people to stop adding more Tom and Mary sentences?

I think that you should let people add whatever sentences they want, as long as they're not incorrect or offensive to other people.

I think that censorship of the names "Tom" and "Mary" will just drive people away from this website.

AlanF_US AlanF_US August 7, 2019 August 7, 2019 at 11:01:48 PM UTC link Permalink

Not only does implementing customization take time, as Trang has pointed out, but only the users who have accounts and are skilled with computers know how to take advantage of it. Everyone else ends up with default settings, and the discussion over what those should be is much the same as the original one. Furthermore, the ability to define a custom view that screens out Tom-and-Mary sentences doesn't resolve anything in the corpus. If people don't write anything but Tom-and-Mary sentences, the custom view will end up empty, and the corpus will be just as homogeneous as before.

DostKaplan DostKaplan August 7, 2019 August 7, 2019 at 1:27:45 AM UTC link Permalink

What if I am looking for sentences that look like "Tom's *"? Right now if I search for "Tom's *", I get a lot of results precisely because of the use of Tom. If we're going to introduce a bunch of other male names, you better change the code to look for "\w\'s \w" otherwise the person looking for "Harun's *" may not get any results.

{{vm.hiddenReplies[32359] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 7, 2019 August 7, 2019 at 10:23:52 PM UTC link Permalink

I'm not sure what is the problem there.

If we introduce a bunch of other names, the existing sentences with "Tom's *" won't disappear. You will still get a lot of results.

Today if someone is looking for "Harun's *" they won't find anything. If we introduce more names, they may actually find something.

Implementing the possibility to look for "\w\'s \w" is independent of the names that we use. It can be useful whether we stick to 2 names or expand to 2000 names.

CK CK August 7, 2019 August 7, 2019 at 4:17:23 AM UTC link Permalink

Why I think it's not a good idea to stop the use of the names Tom and Mary and a possible solution for increasing the number of names.

http://bit.ly/tatnames

{{vm.hiddenReplies[32361] ? 'expand_more' : 'expand_less'}} hide replies show replies
Objectivesea Objectivesea August 7, 2019 August 7, 2019 at 4:40:18 AM UTC link Permalink

Excellent! +1

This might just satisfy everyone who contributes to Tatoeba, providing it is technically feasible.

{{vm.hiddenReplies[32362] ? 'expand_more' : 'expand_less'}} hide replies show replies
User55521 User55521 August 7, 2019, edited August 8, 2019 August 7, 2019 at 5:47:49 AM UTC, edited August 8, 2019 at 6:26:08 AM UTC link Permalink

> providing it is technically feasible.

It isn't

___

EDITED LATER: I must admit I've overlooked this technical proposal, looking at the broad description ('A Possible Solution') in the main page.

This technical proposal is indeed feasible (since you're only converting English sentences — I've thought the idea was to apply this to all languages, which would be very hard)

User55521 User55521 August 7, 2019, edited August 7, 2019 August 7, 2019 at 9:49:46 AM UTC, edited August 7, 2019 at 9:49:56 AM UTC link Permalink

> http://bit.ly/tatnames

I'd like to comment on one point:

> Getting Sentences Translated into More Languages
>
> This also means that we are more likely to get
> sentences with more translations into various languages.

This assumes that translators would translate any sentence, and the probability of getting a Tom sentence translation is the same as the probability of getting a Ikuyo sentence translated. Which is questionable.

I believe people would feel more invested in the corpus which describes things around them, then in a corpus describing Tom learning French in Boston. So, your option is not having 'Tom can speak French' in all the languages, your option is having 'Tom can speak French' in *less* languages.

TRANG TRANG August 7, 2019 August 7, 2019 at 11:30:17 AM UTC link Permalink

So here's what doesn't make sense to me.

(1) On redundancy

Your statement is:

> Limiting ourselves to a select group of proper nouns means we get
> a wider set of sentences or sentence patterns without a lot of redundancy

Following on your example, you don't want to have:
- Tom can speak French.
- George can speak French.
- Kenji can speak French.
- etc

Then why not use "he" and "she" as much as possible? Why didn't you have this rule: if a sentence sounds natural by using "he" or "she", then use it instead of using a name.

It's much more likely that someone will create "He can speak French" than "Tom can speak French". If you create any Tom/Mary sentences with simple patterns, you are ultimately adding to the problem of near duplicates. Because now instead of having only "He can speak French", there's also "Tom can speak French".

In this context, I don't see an advantage of using "Tom" (or any name) when using "he" sounds natural too. With "he" you can also make substitutions. Here's a quick demo of sentences without names where I'm replacing basically replacing "he" with Tom, then Pedro:
https://i.gyazo.com/02285159276...5ce3fbdcb2.mp4

At the top you have sentences without names. I picked sentences from your demo (http://a4esl.org/temporary/tato...as/wildcards/) and anonymized the ones that to me sounded natural without the name.
At the bottom you have the same sentences but with "Tom" or "Pedro" instead of "he".


(2) On getting sentences translated into more languages

You statement is:
> instead of "Ikuyo can speak Spanish" with a Russian translation,
> "Winston can speak Turkish" with a Japanese translation and
> "Fred can speak German" with an Italian translation, we would
> quite likely have "Tom can speak French" with all those languages.

Assuming you want to prevent repetitive sentences that just differ in languages, or cities, why not enforce generic sentences, as long as the sentence sounds natural enough?

- He can speak this language.
- He lives in this city.

We can do easily substitutions here as well.
- We can replace "this language" by Spanish, Turkish, German...
- We can replace "this city" by Tokyo, Boston, London...

The general rule would be: whenever possible, use "this <concept>", where <concept> could be language, city, country, planet, fruit, vegetable, beverage.

Now I don't think redundancy is such an issue that we need to enforce rules on how people create sentences. But assuming we wanted to do that, recommendations should just basically say: avoid names as much as possible, make generic sentences as much as possible. No?

{{vm.hiddenReplies[32369] ? 'expand_more' : 'expand_less'}} hide replies show replies
mraz mraz August 7, 2019 August 7, 2019 at 1:11:36 PM UTC link Permalink

TRANG: ...avoid names as much as possible, make generic sentences as much as possible.

mraz: Yes / IGEN

Aiji Aiji August 7, 2019 August 7, 2019 at 11:46:56 PM UTC link Permalink

Unfortunate that our chances of getting a direct answer are so low :)

Aiji Aiji August 7, 2019 August 7, 2019 at 11:45:42 PM UTC link Permalink

Let me add a bit to what TRANG already said.

First, and most importantly, your notes are intended for mass production of sentences. That was already said before, but I really think that your priority is quantity, clearly over quality. If not, please prove me wrong.
In order,
> This also means that we are more likely to get sentences with more translations into various languages.
1. You don't have evidence of that (hence the "we are more likely").
2. You don't take into account names that impact the rest of the sentences, as was mentioned MANY times before, and ignored by all your notes. Taking Fred and Winston as examples of replacement for Tom is biased, and of course it proves your point (that's a very nasty way of arguing, by the way).
3.Suppose we have 5 different sentences, each of them translated into only one language, but each different (sentence 1 => language 1, sentence 2 => language 2, and so on). What is the difference with having only one sentence translated into the five languages? Some people will answer that it makes indirect translations! Again, they don't know how names work outside of English. Indirect translation are far less important that direct translation, anyway.

> Audio
> When it comes to audio, I would much rather spend my time recording a set of sentences that don't have a lot of redundancy. When it comes to intonation and rhythm of speaking, there isn't much difference between "Ruth can speak Dutch" and "Tom can speak French," so I would rather get a variety of sentence patterns recorded than record these kinds of near duplicates.
You make a good point. However, as it was mentioned several times before, you're bending the tool to meet your needs. The correct way to avoid what you mean is to filter sentences ON YOUR SIDE. You may think that everybody thinks like you, but that is far from being the case. Please don't impact everyone's work to avoid a simple little extra work on your side.

>My Own Contributions
That's of course your right, and it seems logical to me. However, you writes
>Not only am I not contributing near duplicates of my own sentences, but I'm also not contributing near duplicates of sentences by other members who use the same set of wildcards.
May I ask how you avoid your own near-duplicates?
Would you then considering extending your set of wildcards? If you know your own set of wildcards, you may be able to avoid your own duplicates the same way that only considering Tom. For example, would it impact your work to use Tom, Sergei, Daisuke, Wei, etc. ?

>Do We Need to Replace Textbooks?
>Some members may argue that we need to have various names so that we offer the same thing that textbooks offer.
I don't think anybody has argued that. However, we have argued that we don't want to be a dictionary. Having "inverted near-duplicates", like "Tom likes ___." industrially produced do not help in this way.
>Would it not be more important or just as important to make sure that we have a large number of sentences that can take a person's name and an easy way to find such sentences?
Not if the number of sentences that can take a person's name is astoundingly crushing all other sentences. Please have a look at the number of sentences containing Tom, and tell me that nothing is wrong. That is more than a "Tom or not Tom" issue.
To stay in the topic, again, you are not taking into account diversity of name agreements.

> For Members Who Don't Like the Name Tom
That is the most unbelievable part of your notes. Do you think that to find standard, basic English sentences, the normal procedure is to start a search with removing words? This is not only about disliking Tom, it is about Tom being omnipresent.
https://tatoeba.org/fra/wall/sh...#message_32383

> Some Stats
Are we supposed to be satisfied that 800 000 thousands sentences do not contain Tom and Mary?
What about the 400 000 thousands that do contain them? Do you think this number is normal?
What about including "Boston", "French", etc.?

> Consensus?
Irrelevant. Please review the definition of consensus. It not about a percentage, it is about a general agreement without formal opposition.

> A possible solution
No. As mentioned before. (How many times did I wrote that?)

fjay69 fjay69 August 7, 2019 August 7, 2019 at 6:57:34 AM UTC link Permalink

Just a couple of proposals. Just a mindstorm.
First, how about making two types of links? The first type will link sentences with the same meaning and the same names (Tom loves Mary. - Том любит Мэри.), and the second type will link sentences with the same pattern (Tom loves Mary. - Sami loves Layla.).
Second, we have a feature called "Alternative script" (for Japanese, it's using for furigana). So we can add another property "Sentence pattern". For example: Tom loves Mary - <MaleName> loves <FemaleName>. Pattern can be created automatically based on dictionary (dictionary of personal names, countries, cities, etc.). When a user add new sentence, Tatoeba can warn him that a sentence with a same pattern already exists.

{{vm.hiddenReplies[32365] ? 'expand_more' : 'expand_less'}} hide replies show replies
TRANG TRANG August 7, 2019 August 7, 2019 at 10:14:17 PM UTC link Permalink

> First, how about making two types of links?

This has been proposed: https://github.com/Tatoeba/tatoeba2/issues/1902

> Second, we have a feature called "Alternative script" (for Japanese, it's using for
> furigana). So we can add another property "Sentence pattern".

That's a possibility, yes, and I'm pretty sure some people have already suggested this on the Wall. Beyond your example of replacing names of people, of countries, of cities, know that there are tools that can identify the whole structure of a sentence. They can tell you what is the subject, what is the verb, what is the object, what is the tense of the verb...

But just so you know, we are actually not lacking ideas on how to improve Tatoeba. If I would sit there and tell you about all my ideas for Tatoeba, I could talk for 10 days straight. Our eternal problem is that we lack the resources to make them a reality. We don't have an army of developers and designers.

In any case, as you seem motivated to come up with ideas to improve Tatoeba, I recommend you to read this article:
https://github.com/Tatoeba/tato...mitting-issues

This is mostly so you know that there is a place where we document our problems (and our ideas on how to solve them): the issues in GitHub. Feel free to explore these issues and comment on them and share your insights. Feel free as well to create new issues whenever necessary.