Menu
Hallo!
I noticed that a lot of Tatoeba sentences contain numerals instead of words to express quantity, time, length and so on.
Why is this practice so widespread? It has some really obvious disadvantages for language learners as well as for linguistic researches.
What is your opinion about it?
>Why is the practice so widespread?
In many cases, it's the natural way that native speakers write.
Yes, but some Tatoeba policies don't really follow how native speakers write. We always tried to be on the learners' side.
(p.s. I will mass-disadopt all my English sentences soon.)
>> Yes, but some Tatoeba policies don't really follow how
>> native speakers write. We always tried to be on the
>> learners' side.
Really? :o
Well, I was away for a couple of years, but I always had this impression. I can be wrong, though.
By the way, here an example. Many Italian sentences in Tatoeba use personal pronouns as subjects. Italian is, however, a pro-drop language. It means it's more natural to say "Am a boy" rather than "I am a boy". You are allowed to explicitly express the subject and still get a natural sentence only in particular contexts. This context is often absent in Tatoeba.
However, it's obviously easier for a learner to read sentences with explicit personal pronouns, so these sentences should be kept, even if they are NOT (in my opinion) good examples for the Italian language.
> We always tried to be on the learners' side.
I'm afraid this is not true. Tatoeba never was and never will be intended for beginners; however, it can be used as a complementary resource for teachers and students, beginners included.
Tatoeba was never meant for beginners: true. But it was born and meant for learners. But I never said I agree with this direction...
Of course I agree that numbers, symbols or whatever (see my comment and examples below) should be indicated, be it on a comment, in an audio recording... but I don't agree that one should replace the usual, correct representation of symbols by its phonetic equivalent. As I said, notation is an integral part of a language and should be treated as such.
I support this idea, Pharamp. That's a detail, but an important one for language learners. I think inside of Tatoeba it's worth writing numbers as words even there where native speakers write otherwise.
The ideas I had to solve this problem backwards were:
1) Mass-edit the database. (Applicable only for a couple of languages, like Italian, French or English) This violates some rules stated by Trang, because we should always ask the owner first, but it's quicker.
2) Mass-comment sentences containing numbers, inviting to change them. It's technically possible but it should be done slowly in order to preserve the database. It also requires that people edit manually their sentences, which can be rather dull.
But firstly, it's fundamental to gather more opinions. If the community doesn't agree, we won't obviously touch anything at all.
I doubt that these project would trigger enthusiasm if treated in an administrative way. It may be better if also in future the owner of the sentence is given the freedom to decide weather in the example a number is written in words or not. If not he or she should be encouraged to indicate the spelling of the number in a comment beneath the example sentence. Some of us practice this already, deciding in each case for one of these two possibilities depending on the length of the numeral.
I understand your view, but I'm looking at the problem in a more computational way. Some languages decline numbers. If you have a number, it's more difficult for a machine to get the right case. If you have a word, it's easier for a machine to get the cardinal or ordinal number.
In a future (and maybe only in our dreams) Tatoeba, each sentence should get a Part-of-speech tagging, and then people should be able to choose to display numbers as numbers or words.
By the way, this project *is* treated in an administrative way. We have guidelines for everything now and a lot of things are restricted, even if we don't like it. I don't personally think "Please try to express numbers in words" as a heavy imposition :)
Yes, we do have guidelines. But so far, the guildelines have been consistent with each other. The rule you're suggesting is clearly at conflict with other guidelines, like «We want sentences to remain as "raw" as possible».
I believe it's important to keep the entry barrier low, and keeping the rules to minimum is important. The 'raw sentence' rule is the easiest to understand and explain (just write natural sentences, like you would do anywhere else).
We already had this discussion months ago.
Some find it funny to translate 5 into 5. I find it completely silly on a service that is supposed to help people acquire linguistics skills.
I chose my side. I use words.
Sorry, I was absent. But thanks for your feedback!
p.s. Could you link it to me?
J'ignore comment interroger les historiques du mur. Suite au plantage du service, j'ignore s'ils ont même été conservés. Désolé...
I've uploaded the pre-crash dump of the wall messages here: http://tatoeba.org/files/downlo...e-crash.csv.xz
Such things should be avoided.
http://tatoeba.org/eng/sentences/show/3433
http://tatoeba.org/eng/sentences/show/504401
Why? I think they should be encouraged if both ways of writing are natural.
The only problem is with Tatoeba's display, not with sentence themselves. Tatoeba should group similar sentences, and display only one when there are several alternatives.
Do we collect sentences native speakers would say or what they would write? Because if the latter is the case, half of the (Italian) corpus is nonsense.
Well, at least *I* thought we collect such sentences. I can't say anything about Italian corpus since I don't know a single Italian word, but at least the Russian, Ukrainian and Belarusian subcorpuses generally conform to such a definition.
Does "the latter" refer to "what they would write"? I always thought that we collect examples of "written" language in the frist place.
Some languages represent only written text (e.g. Literary Chinese, Latin), some repesent mostly spoken (Iraqi Arabic, Cantonese). Most languages are in-between, with some sentences representing spoken language, and some sentences representing a written language. I don't see a problem here.
I thought the way of Tatoeba is to collect anything that could be of any actual use? This seems most reasonable, since language is a pretty complex thing... preventing input, however specific its use, doesn't do the learner well.
In fact, spoken language examples are perhaps of more use to an advanced learner, since they are more difficult to come across and pick up; so neither language register should be discriminated against.
Concerning numbers, I think the practice some people adopted - write the number as it is, but post its wording in a comment below - is usually the best one. Longer numbers, like years, would distract from the rest of the sentence...
> I think the practice some people adopted - write the number as it is, but post its wording in a comment below - is usually the best one.
No it isn't, because there are several ways to write numerals in words in the same language. So there should be a sentence for each way, with a corresponding audio.
Example :
80 in French is :
Quatre-vingt in France and Belgium
Octante or Huitante in Switzerland
...
Translating "80$" from English to "80$" in French is, at best, very silly and is of no help to anybody...
Hello, the master of absolutes.
Evidently you have wonderfully ignored the adverb 'usually', which is not negated by a special example (it is rather uncommon for a number to have more than one word associated with it, isn't it so?).
Mind that we do not translate words, we translate sentences. Those have numbers in most cases written numerically, as it is, though they can also be written with words. The former can be said to represent the tradition in writing, while the latter would be more demonstrative and also possible. Therefore, either is good, since either is well possible and used.
It is not of such grave importance as to rationalise forcing contributors to double their sentences in both fashions, though, since the rules of number creation are quite finite, generally, and so can quickly be learned by a fresh beginner.
It would certainly enrich the project to have all possible variations of a given sentence, but this tedious doubling is a lot of work which people can use to make other useful sentences instead. We hardly have enough contributors at the moment that assigning mundane tasks to people en masse would be the optimal way.
What's the big problem with comments as a partial supplementary measure, anyway?
I do agree on the subject of audio, but lack of it is not due to people writing things the way they do. (Sentences that do have audio are, I'm inclined to agree, better written in words, so that one can see the relationship between the writing and the pronunciation.)
In fact, even writing the number properly won't always help get accustomed with the common pronunciation, since numbers, being among the commonest words in speech, are often pronounced very 'carelessly' with a lot of consonant and vowel reductions.
Concluding, I'm not actually against wordifying numbers myself - I think I do that more often than using math notation - but I don't think it's worth putting the strain of necessity on everyone. Posting the clarification in comments is an okay measure for now until some automated way can be introduced.
Thanks you for having read this so far!
PS. I understand you're trying to be concise, but it might be more polite to formulate your opinions as opinions and not facts even if you consider them factually true. It would give one the feeling that you are actually interested in hearing others' points...
Very fine nuance indeed between "factually true opinions" and facts. It obviously sounds like essential to the debate...
The fact that Tatoeba holds but one audio per sentence is not an opinion at all, even a very factual one. It's just a fact.
And it is also a fact that hearing "Huitante", while learning French in Switzerland won't help you at all understand what "quatre-vingt" means.
It might be that there are few instances of languages where the same numerals have different writings, although you can't know that because there are 6000 + languages in the world and some are very surprising when it comes to numerals...
As a matter of fact, any word, including numerals, can be a modifier to or be modified by the former or the following word in a sentence. I know that this phenomenon takes place in many languages. In the case when a numeral is modified in such a way, writing it with numbers doesn't help at all to learn its pronunciation.
Illustration in French :
20 ans
20 mètres
You would think that 20, in both cases, is pronounced the same way if you're a native English speaker.
But you would be very wrong ! And in some cases, you wouldn't be understood at all if you insisted on pronouncing it the same way.
French is a kind of "meta-language", ie it is a language that you cannot pronounce if you can't write its words properly, because it was derived from other languages by well-written people. Hence the emphasis on orthography in French.
So knowing that 20 is written as "vingt" with a final t is ESSENTIAL to its pronunciation.
When you pronounce "20 ans" while ignoring that 20, as a word, ends with a t and subsequently not making the necessary liaison with "ans", I won't understand what you say at all...even if I try very hard and although I'm very much used to hear broken French from various foreigners, because I will believe you're trying to say something else that vaguely sounds like "vin-an" and which relates to nothing in my mind...
BTW, what about years. Is it better to write them with digits or in words?
As for Latin, I usually add both, because I think both structures (ie, using the Roman numerals and words) are worth adding.
http://tatoeba.org/eng/sentences/show/1338465
At least in Portuguese, writing them in words would look weird and unusual. I'd rather add a comment saying how to read them.
Hello!
Frankly, Pharamp, this is quite shocking. I am strongly against “solving” this non-problem.
As long as natives do write numbers using digits and words in real life, I see no reason not to allow both. Languages are not this much consistent, and so Tatoeba should be. Researchers and learners should embrace languages like they actually are, and not change them in order to match their own specific needs.
Furthermore, there are plenty of contexts where numbers are almost always written with digits, or almost always written in words. You just can’t mass-edit one way or the other for the sake of consistency. Of course every writing is possible, but conventions exist and should be reflected in Tatoeba’s content. Just to name a few : I was born in 1952. There are 1,952 sentences. The book costs $19.52. I got a 404 error. My phone number is 0123456789. I’ve got a 32 bits processor, and a Nintendo 64. The next train arrives at 7:50 p.m. She killed two birds with one stone. One should know that. Remember that two wrongs don’t make a right. Give me five! He can talk French twenty to the dozen. Le weekend du 15 août. Je me suis mis sur mon trente-et-un. Appliquons la règle de trois. Vingt-deux, voilà les flics. C’est trois fois rien. Mille mercis. Les mille et unes merveilles du monde. (Now I’ve got to add all these to Tatoeba.)
To me, the reading problem that others mentionned is a different problem that should be solved with a different solution. Like adding audio, or adding readings (like Japanese already has, though it’s broken at the moment).
Mettez-vous donc une minute dans la peau d'un apprenant du français :
S'il est en Suisse, on lui a dit que 80 se dit Huitante (ou Octante, selon les cantons...)
Il décide de vérifier la prononciation de Huitante sur Tatoeba (huit, en soi, est déjà un défi de prononciation pour tous les anglophones, hispanophones, lusitophones, russophones, japonophones et j'en passe !)...et là, SURPRISE, il entend "katrevin" !?!
Que va-t-il en tirer ?
Je vous le dit tout net : de la merde.
Ne pas écrire les nombres pour les apprenants est une connerie absolue.
Bienvenue dans le monde réel.
Vous ne répondez pas à l'objection.
En quoi entendre "katrevin" aide-t-il un apprenant du français qui lit "Huitante" ?!?
+100
Besides, it would be nearly impossible to write things like these in words and yet make them look natural:
http://tatoeba.org/eng/sentences/show/2169927
http://tatoeba.org/eng/sentences/show/2945976
http://tatoeba.org/eng/sentences/show/2169906
http://tatoeba.org/eng/sentences/show/2583226
http://tatoeba.org/eng/sentences/show/2583268
http://tatoeba.org/eng/sentences/show/2169864
http://tatoeba.org/eng/sentences/show/2176019
http://tatoeba.org/eng/sentences/show/2176014
http://tatoeba.org/eng/sentences/show/2176011
http://tatoeba.org/eng/sentences/show/2176018
Terminology is a fundamental part of human language.
Hallo Gillux,
I think a lot of people here missed something I wrote but didn't get noticed at all because of the structure of this wall. But well, no point on going on with this issue.
Cheers.