Thread #32136 - Tatoeba

I know that it has been argued previously that it is best to try to use a small, more or less universal, set of personal names and placenames for the Tatoeba project, whilst others have preferred to use locally known or historically significant names from their own culture or country. While I don't wish to privilege some specific names above others, and a person's attachment to locally significant names is certainly understandable, perhaps even laudable, there are very practical advantages to consistently using a small set of names.

As an example, the names Sofia, Adamo and Lidia and the town of Bjalistoko are culturally relevant to Esperantists because the names are those of Esperanto founder L.L. Zamenhof's children, sadly murdered in Hitler's extermination camps, and the Polish town of Białystok is where Zamenhof lived for many years.

But Esperantists using the Tatoeba site, attempting to hew to some universality, will generally use the names Tomo or Toĉjo, Maria or Manjo, Johano or Joĉjo, which are regularly formed variations of the English names Tom, Mary and John, and Bostono, the Esperanto proper noun corresponding to Boston.

Similarly, there are traditional Arabic or Berber forms of names that are closely related to the names Tom, Mary and John, and the name Boston surely is findable on an Arabic world map. In my opinion there should be no problem with using these consistently in order to prevent the near-duplications that arise when we pick a different set of names like Sami, Layla, or Mennad, etc. Here are my suggestions; I wonder what others think:

Tom توماس
Mary مريم
John يوحنا
Boston بوسطن

hide replies show replies

User55521 July 2, 2019, edited July 2, 2019 July 2, 2019 at 1:10:41 PM UTC, edited July 2, 2019 at 1:36:15 PM UTC

flag

Report

link

Permalink

> there are very practical advantages to consistently
> using a small set of names

There are also practical advantages to *not* using them:

1. Names are words that can be translated. E.g. Russian names are translated in Belarusian (Елизавета/Yelizavieta becomes Лізавета/Lizavieta), while English names aren't (Elizabeth becomes Элізабэт/Elizabet; unless she's a queen, then she's Лізавета/Lizavieta). You're losing a great deal of information by forcing all names to be 'Tom' and 'Mary'.

2. Names are words that follow some language rules. Names are declined differently (e.g. Russian declines 'Darwin' and 'Pushkin' differently), spelt differently (e.g. Lithuanian surnames like Чюрлёнис 'Čiurlionis' are the only case when ю, я can follow ч in Russian), capitalised differnently, get different suffixes and prefixes, etc. Again, all this is often lost during the unification.

Unifying names will make Tatoeba less useful. E.g. imagine creating a language detector. If you replace Shymkent with Boston, Tatoeba will have no examples of Шы in Russian, so a detector based on Tatoeba data will likely to misdetect sentences about Shymkent as non-Russian.

If we unify names, then why not unify other words too? Let's unify all actions to 'eat' and all objects to 'apple'. We'd have a lot of sentences like 'I don't mind eating an apple', 'I've never eaten an apple' and 'I'd like to eat an apple', etc. Surely this would reduce a lot of duplication!

This is reductio ad absurdum, but I believe it's not really different from the name unification. *Of course* we will get less duplication if we discard a lot of information! But that's hardly practical.

hide replies show replies

Shishir July 2, 2019 July 2, 2019 at 7:58:36 PM UTC

flag

Report

link

Permalink

Ricardo14 July 2, 2019 July 2, 2019 at 10:25:13 PM UTC

flag

Report

link

Permalink

+1,000

soweli_Elepanto July 2, 2019 July 2, 2019 at 10:47:22 PM UTC

flag

Report

link

Permalink

AlanF_US July 3, 2019 July 3, 2019 at 12:22:24 AM UTC

flag

Report

link

Permalink

If we add together Shishir, Ricardo14, and soweli_Elepanto's votes, and throw in one for Seael and one for me, we get 1004 votes against restricting vocabulary to "Tom" and "Mary" (and "apple"). I have always thought that the people who favor the restrictions were in quite a small minority (fewer than five, let's say), but an influential one, because at least one of those people has, over the years,
- contributed a huge number of sentences in the best represented and most translated language (English),
- selected sentences with that restricted subset of names when compiling lists for other people to translate, and
- complained regularly when people wrote sentences that included names not in the subset

Just for the record, these restrictions were never a Tatoeba policy, and there's nothing to stop people from writing sentences with other names. (Of course, there's nothing to stop other people from writing or selecting only sentences that feature "Tom", "Mary", "Boston", and "apple".)

hide replies show replies

seveleu_dubrovnik July 8, 2019 July 8, 2019 at 10:14:53 AM UTC

flag

Report

link

Permalink

And +1 for me. Concerning East Slavic languages, “Mary” is an awful choice. It has defective (read incomplete) declination pattern. The inability to use local names compromises the entire Slavic scheme of syntactic encoding with cases. E.g. “Я описал Мэри” ‘I describe·PAST Mary·ANYCASE’ ‘I described Mary to someone’ or ‘I described someone to Mary’. À bannir.

Thanuir July 3, 2019 July 3, 2019 at 1:04:21 PM UTC

flag

Report

link

Permalink

Minäkin olen samaa mieltä.

CK July 3, 2019, edited July 3, 2019 July 3, 2019 at 1:17:31 AM UTC, edited July 3, 2019 at 1:18:50 AM UTC

flag

Report

link

Permalink

My feeling is that if a moderately-competent language learner knows how to make the appropriate changes, then we don't really need a lot of near-duplicate sentences with the only difference being the person's name.

Being a collection of sentences, we're not trying to take the place of textbooks.

Here are a few of the advantages of using a set of wildcards as described on http://bit.ly/tatoebawildcards.

* We get more translations grouped together. That is, instead of "My name is George" being translated into German, "My name is Fred" translated into French, "My name is Dan" translated into Spanish, etc., we can get many languages grouped together with "My name is Tom." This also means that we can see indirectly-translated sentences that have the potential of being directly linked.

* If a Russian contributor contributes a sentence with the equivalent of "Tom", there is a chance that the same basic sentence already exists in another language and these will eventually be linked as people notice indirect translations.

* Using the wildcards also means that the same contributor doesn't accidentally submit basically the same sentence that he/she did earlier with a different name. Tatoeba.org doesn't let the exact same sentence to be submitted again.

* For the same reason, if you add a translation to an existing sentence with the "Tom" equivalent already in the database, the sentence you are translating gets linked directly to that sentence, also showing any indirectly-linked translations that may already exist.

hide replies show replies

User55521 July 3, 2019, edited July 3, 2019 July 3, 2019 at 8:24:13 AM UTC, edited July 3, 2019 at 4:02:44 PM UTC

flag

Report

link

Permalink

> My feeling is that if a moderately-competent
> language learner knows how to make the
> appropriate changes

My feeling is that languages are often the *sole* examples of some linguistic phenomena.

For example, in English, the digraph ff at the beginning of the words might be capitalised ff (not Ff): https://en.wikipedia.org/wiki/R...roness_ffrench

I doubt a moderately-competent language learner knows this. Heck, I doubt many native speakers know this! This is very useful as a sample data for all kinds of automatic capitalisation algorithms. But too bad for their authors, because Castle ffrench will be replaced with Boston, and Rose ffrench will be replaced with Mary.

Another example. What's the possessive case of Jesus, Jesus' or Jesus's? I assume I'd use Jesus' about a biblical Jesus, but what about people named Jesus from Spain? I consider myself a relatively competent English speaker, but I have no idea! Having examples with the possessive cases for names ending in -s would certainly help (especially if they had audio, because 's might have a vowel inserted sometimes).

More examples from other languages:

In Russian, there are two ways to say someone's: genitive case (комната Саши room Sasha's.GEN) and adjective (Сашина комната Sasha's.ADJ.FEM.SG room). However, adjectives are only formed from common local names, and sound strange with Tom and Mary. We have only 1 example of an adjective from Tom in the whole corpus, #922679. This makes it looks like adjectives are no longer used, so corpus with Tom misrepresents Russian grammar.

In Belarusian, some people don't know how to decline many uncommon names. For example, in Belarusian, the genitive case of Зміцер/Źmicier is Змітра/Źmitra, but many people would say Зміцера/Źmiciera. Having diverse names would help to understand the conjugation of the words.

___________________

This is even worse with languages and cities. Often, some languages have special words for some cities and languages, while others don't. For example,

- Russian has an illogical way of forming words like 'people of <city>' (like Boston > Bostonians), e.g.: Boston - bostonTSY, Moskva - moskvICHI, Minsk - minCHANIE, Varshava - varshavIANIE, Kiev - KievLIANIE, Odessa - odessITY, Arhkangelsk - arkhangelOGORODTSY, etc. If you only keep Boston, you'll miss most of these forms.

- Portuguese treats 'wine of Boston' (vinho de Boston) and 'wine of Porto' (vinho do Porto) as the same construction, while other languages might have a separate word for Porto wine (English Port wine, Russian портвейн), but don't have a word for Boston wine.

- Russian words for language can be placed either before the word (французский язык 'French language') or after (язык хинди 'Hindi language'). Usually you can guess it by the word form, but not always (коми язык 'Komi language' looks like Hindi, but is placed like French)

- Many languages have special words for '<language>-speaking', e.g. 'Francophone', 'Lusophone', but don't have words for other languages (Hindiphone??? Komiphone???)

___________________

All those things are not something we can expect from a 'moderately-competent language learner'.

hide replies show replies

Aiji July 3, 2019 July 3, 2019 at 2:19:13 PM UTC

flag

Report

link

Permalink

Long story short (although the long story is important), some people wants to bend the source to fit their tools, instead of adapting their tools. That's not new.

Placeholders inside the source is not good for corpora, whereas placeholders inside tools are so awesome and beautiful. (because seeing flexions and stuff changing live is cool)

In many cases, redundancy does not equal waste. Around a few percents, that is quite negligible in my opinion.

ssvb March 30, 2023 March 30, 2023 at 1:13:04 PM UTC

flag

Report

link

Permalink

The name "Том" (Tom) may be also ambiguous in Belarusian language in some contexts. The English word "volume" (a book forming part of a work or series) is translated into Belarusian as "том". If it's capitalized in the beginning of a sentence, then we may get "Том зваліўся з кніжнай паліцы". And it translates into "Tom (or possibly a volume) fell off a bookshelf".

The plural form of "mayor" (the head of a town) is translated into Belarusian as "мэры". Which may be also confused with the name "Мэры" (Mary) in a sentence like "Мэры ды Томы ў нашых гарадах адсутнічаюць". Which translates into "There are no Marys (or possibly mayors) and Toms in our cities".

Would it be wrong to use the words "Thomas" and "Maria" instead of "Tom" and "Mary" on Tatoeba?

hide replies show replies

shekitten March 30, 2023 March 30, 2023 at 5:17:51 PM UTC

flag

Report

link

Permalink

soweli_Elepanto already uses Tomaso in Esperanto, and Latin contributors generally use Thomas or Didymus, so I think it's no problem to translate "Tom" as "Thomas" (or some other diminutive of Thomas, if one exists in Belarusian).

Pfirsichbaeumchen March 30, 2023 March 30, 2023 at 8:51:40 PM UTC

flag

Report

link

Permalink

You can also use Фама and Марыя, or whatever would seem good Belarusian variants thereof.

hide replies show replies

ssvb March 30, 2023 March 30, 2023 at 9:30:08 PM UTC

flag

Report

link

Permalink

It's not a big problem either way and we can survive with Tom and Mary just fine. Similar to how having Victor as one of the Tatoeba character names wouldn't be a problem for English either. It's almost never actually ambiguous in reality and only language learners may be confused sometimes (if the sentence is intentionally constructed to be tricky).

Thanuir March 31, 2023 March 31, 2023 at 8:20:20 AM UTC

flag

Report

link

Permalink

Toivottavasti näiden lauseiden tapauksessa monet eri tulkinnat näkyvät käännöksissä muille kielille.

soridsolid April 29, 2023 April 29, 2023 at 1:39:36 AM UTC

flag

Report

link

Permalink

I disagree with the sentiment that we should move away from having these wildcards. The examples brought up by @impersonator and @ssvb are not exclusive to those languages. Most, if not all, languages have many irregularities,... too many to list all of them here.

My point is: you can't spoon-feed every single irregularity to the language learner or programmers using this as a database for their language-learning applications. At some point the language learner will have read or practiced thousands of Tatoeba sentences and should be ready, if they weren't already doing so, to get out there and familiarize themselves with their target language in the wild. Those irregular forms and peculiarities, (e.g. Russian demonyms) will be stumbled across by the learner and internalized eventually. There's absolutely no need to stuff this website with all kinds of irregular variations and declinations of merely names. Hence @ssvb's example about Tom meaning ''volume'' in Belarusian is a bad one. Even in languages closely related to English such as German and Dutch will you find instances of English names, place names, etc. already having a word that is written exactly like it but with a different meaning. ''Tom'' in Danish means ''empty'' and even in English ''tom'' (not capitalized) as a noun has different meanings. For some languages the problematic names are indeed ''Tom'' ,''Mary'' and ''John'', so the only solution would be to find popular names that don't have similar sounding words in any language,...which wouldn't be an easy task, but be my guest.

hide replies show replies

Cangarejo April 29, 2023, edited April 29, 2023 April 29, 2023 at 3:34:34 PM UTC, edited April 29, 2023 at 5:06:41 PM UTC

flag

Report

link

Permalink

What would you do with currently existing sentences?

Also, when a sentence or translation is added to Tatoeba, is anyone keeping track of how much it contributes to the corpus in terms of words, expressions, meanings, concepts, or messages? Not that I’m implying someone should.

What is the official purpose of Tatoeba?

Menu

Need some help?

Developers

About