I have a few words about the discussion ( https://tatoeba.org/eng/wall/sh...#message_32296 ) about increasing the diversity in the corpus by limiting the use of heavily-used names such as Tom and Mary (let's call it "Tom-and-Mary-ing"). This is a subject that people feel passionately about, for various reasons. But Trang is making a real effort to promote a rational, calm, inclusive discussion. She is also seeing whether we can build any kind of consensus rather than splitting into factions. There are two things we can do to help.
The first is to join the discussion in the first place. One of the key potential participants in the discussion has not done this yet, but I hope he will.
The second is to go easy on terms such as "ban", "censor", "outlaw", "police", and "war", whether the presumed object is a person or a word. Anyone who knows Trang's style, even from this discussion alone, can see that she's not a bully and her intent is not to beat people into submission. She started by laying out a policy that was aimed at trying to figure out whether there's a way to encourage people to use a variety of names rather than to punish them for using a small set of names. And even though that was a pretty gentle approach, she backed up to start from an even more neutral place, a discussion of what people think on all sides of the subject. In that spirit, here's my enumeration of the pros and cons of Tom-and-Mary-ing, based on a synthesis of the arguments I've seen and my own ideas:
Tom-and-Mary-ing has these benefits:
- easy to create sentences, either manually or via scripts
- easy to process sentences that have already been submitted via scripts
- easy to identify and potentially avoid near-duplicates that differ only with respect to names
- Tatoeba "branding"
- a kind of comfortable in-joke for the community (Tom and Mary as characters who engage in a variety of activities, many of them contradictory)
- ADDED: increase the number of indirectly linked short sentences (by eliminating variety in names in these sentences)
It has these drawbacks:
- elimination of names representative of other cultures, or even of all but a handful of names from English-speaking cultures
- elimination of variety, which can lead to boredom on the part of visitors or community members
- elimination of inflectable names in languages with name inflection, leading to a hole for language learners or automated tools that need to be able to handle these names
- creates a feeling of imbalance
- causes annoyance
- inconvenience or worse for human users of the corpus who want a broader selection of names
- ADDED: will cause poor results when Tatoeba data is used as training data for automated tools (for example, language detectors) that run on real-world data
There are alternative ways to achieve some of the benefits of Tom-and-Mary-ing:
- encouraging the addition of longer, more varied sentences, which are less likely to be near-duplicates
- encouraging people who are generating sentences to select from a longer list of names (for instance, from a baby name site) even if they're using scripts
- encouraging people who are processing submitted sentences to use more sophisticated scripts that do not rely on the identification of only two names
- other kinds of community-building
Are there items I've missed?