I know that it has been argued previously that it is best to try to use a small, more or less universal, set of personal names and placenames for the Tatoeba project, whilst others have preferred to use locally known or historically significant names from their own culture or country. While I don't wish to privilege some specific names above others, and a person's attachment to locally significant names is certainly understandable, perhaps even laudable, there are very practical advantages to consistently using a small set of names.
As an example, the names Sofia, Adamo and Lidia and the town of Bjalistoko are culturally relevant to Esperantists because the names are those of Esperanto founder L.L. Zamenhof's children, sadly murdered in Hitler's extermination camps, and the Polish town of Białystok is where Zamenhof lived for many years.
But Esperantists using the Tatoeba site, attempting to hew to some universality, will generally use the names Tomo or Toĉjo, Maria or Manjo, Johano or Joĉjo, which are regularly formed variations of the English names Tom, Mary and John, and Bostono, the Esperanto proper noun corresponding to Boston.
Similarly, there are traditional Arabic or Berber forms of names that are closely related to the names Tom, Mary and John, and the name Boston surely is findable on an Arabic world map. In my opinion there should be no problem with using these consistently in order to prevent the near-duplications that arise when we pick a different set of names like Sami, Layla, or Mennad, etc. Here are my suggestions; I wonder what others think:
> there are very practical advantages to consistently
> using a small set of names
There are also practical advantages to *not* using them:
1. Names are words that can be translated. E.g. Russian names are translated in Belarusian (Елизавета/Yelizavieta becomes Лізавета/Lizavieta), while English names aren't (Elizabeth becomes Элізабэт/Elizabet; unless she's a queen, then she's Лізавета/Lizavieta). You're losing a great deal of information by forcing all names to be 'Tom' and 'Mary'.
2. Names are words that follow some language rules. Names are declined differently (e.g. Russian declines 'Darwin' and 'Pushkin' differently), spelt differently (e.g. Lithuanian surnames like Чюрлёнис 'Čiurlionis' are the only case when ю, я can follow ч in Russian), capitalised differnently, get different suffixes and prefixes, etc. Again, all this is often lost during the unification.
Unifying names will make Tatoeba less useful. E.g. imagine creating a language detector. If you replace Shymkent with Boston, Tatoeba will have no examples of Шы in Russian, so a detector based on Tatoeba data will likely to misdetect sentences about Shymkent as non-Russian.
If we unify names, then why not unify other words too? Let's unify all actions to 'eat' and all objects to 'apple'. We'd have a lot of sentences like 'I don't mind eating an apple', 'I've never eaten an apple' and 'I'd like to eat an apple', etc. Surely this would reduce a lot of duplication!
This is reductio ad absurdum, but I believe it's not really different from the name unification. *Of course* we will get less duplication if we discard a lot of information! But that's hardly practical.
If we add together Shishir, Ricardo14, and soweli_Elepanto's votes, and throw in one for Seael and one for me, we get 1004 votes against restricting vocabulary to "Tom" and "Mary" (and "apple"). I have always thought that the people who favor the restrictions were in quite a small minority (fewer than five, let's say), but an influential one, because at least one of those people has, over the years,
- contributed a huge number of sentences in the best represented and most translated language (English),
- selected sentences with that restricted subset of names when compiling lists for other people to translate, and
- complained regularly when people wrote sentences that included names not in the subset
Just for the record, these restrictions were never a Tatoeba policy, and there's nothing to stop people from writing sentences with other names. (Of course, there's nothing to stop other people from writing or selecting only sentences that feature "Tom", "Mary", "Boston", and "apple".)
And +1 for me. Concerning East Slavic languages, “Mary” is an awful choice. It has defective (read incomplete) declination pattern. The inability to use local names compromises the entire Slavic scheme of syntactic encoding with cases. E.g. “Я описал Мэри” ‘I describe·PAST Mary·ANYCASE’ ‘I described Mary to someone’ or ‘I described someone to Mary’. À bannir.
Minäkin olen samaa mieltä.
My feeling is that if a moderately-competent language learner knows how to make the appropriate changes, then we don't really need a lot of near-duplicate sentences with the only difference being the person's name.
Being a collection of sentences, we're not trying to take the place of textbooks.
Here are a few of the advantages of using a set of wildcards as described on http://bit.ly/tatoebawildcards.
* We get more translations grouped together. That is, instead of "My name is George" being translated into German, "My name is Fred" translated into French, "My name is Dan" translated into Spanish, etc., we can get many languages grouped together with "My name is Tom." This also means that we can see indirectly-translated sentences that have the potential of being directly linked.
* If a Russian contributor contributes a sentence with the equivalent of "Tom", there is a chance that the same basic sentence already exists in another language and these will eventually be linked as people notice indirect translations.
* Using the wildcards also means that the same contributor doesn't accidentally submit basically the same sentence that he/she did earlier with a different name. Tatoeba.org doesn't let the exact same sentence to be submitted again.
* For the same reason, if you add a translation to an existing sentence with the "Tom" equivalent already in the database, the sentence you are translating gets linked directly to that sentence, also showing any indirectly-linked translations that may already exist.
> My feeling is that if a moderately-competent
> language learner knows how to make the
> appropriate changes
My feeling is that languages are often the *sole* examples of some linguistic phenomena.
For example, in English, the digraph ff at the beginning of the words might be capitalised ff (not Ff): https://en.wikipedia.org/wiki/R...roness_ffrench
I doubt a moderately-competent language learner knows this. Heck, I doubt many native speakers know this! This is very useful as a sample data for all kinds of automatic capitalisation algorithms. But too bad for their authors, because Castle ffrench will be replaced with Boston, and Rose ffrench will be replaced with Mary.
Another example. What's the possessive case of Jesus, Jesus' or Jesus's? I assume I'd use Jesus' about a biblical Jesus, but what about people named Jesus from Spain? I consider myself a relatively competent English speaker, but I have no idea! Having examples with the possessive cases for names ending in -s would certainly help (especially if they had audio, because 's might have a vowel inserted sometimes).
More examples from other languages:
In Russian, there are two ways to say someone's: genitive case (комната Саши room Sasha's.GEN) and adjective (Сашина комната Sasha's.ADJ.FEM.SG room). However, adjectives are only formed from common local names, and sound strange with Tom and Mary. We have only 1 example of an adjective from Tom in the whole corpus, #922679. This makes it looks like adjectives are no longer used, so corpus with Tom misrepresents Russian grammar.
In Belarusian, some people don't know how to decline many uncommon names. For example, in Belarusian, the genitive case of Зміцер/Źmicier is Змітра/Źmitra, but many people would say Зміцера/Źmiciera. Having diverse names would help to understand the conjugation of the words.
This is even worse with languages and cities. Often, some languages have special words for some cities and languages, while others don't. For example,
- Russian has an illogical way of forming words like 'people of <city>' (like Boston > Bostonians), e.g.: Boston - bostonTSY, Moskva - moskvICHI, Minsk - minCHANIE, Varshava - varshavIANIE, Kiev - KievLIANIE, Odessa - odessITY, Arhkangelsk - arkhangelOGORODTSY, etc. If you only keep Boston, you'll miss most of these forms.
- Portuguese treats 'wine of Boston' (vinho de Boston) and 'wine of Porto' (vinho do Porto) as the same construction, while other languages might have a separate word for Porto wine (English Port wine, Russian портвейн), but don't have a word for Boston wine.
- Russian words for language can be placed either before the word (французский язык 'French language') or after (язык хинди 'Hindi language'). Usually you can guess it by the word form, but not always (коми язык 'Komi language' looks like Hindi, but is placed like French)
- Many languages have special words for '<language>-speaking', e.g. 'Francophone', 'Lusophone', but don't have words for other languages (Hindiphone??? Komiphone???)
All those things are not something we can expect from a 'moderately-competent language learner'.
Long story short (although the long story is important), some people wants to bend the source to fit their tools, instead of adapting their tools. That's not new.
Placeholders inside the source is not good for corpora, whereas placeholders inside tools are so awesome and beautiful. (because seeing flexions and stuff changing live is cool)
In many cases, redundancy does not equal waste. Around a few percents, that is quite negligible in my opinion.