Menu

An issue I predicted a while back has become true: an issue caused by allowing Tatoeba to include both macro-languages and their individual languages, as long as there was an ISO 639-3 code representing them.
Specifically we have Arabic, and at the same time Arabic (Gulf), Egyptian Arabic, etc. and a source of much debate these days: we have Berber, while at the same time having Kabyle, Tarifit, etc. Based on the way we set up the platform, 'Arabic' sentences are allowed to remain as they are, and not be forced into one of its individual languages (Gulf, Egyptian, etc.) in the same way Amastan's 'Berber' sentences should be allowed as they are, and not be forced to be labeled as 'Kabyle' (or any other language).
I believe that this issue may be resolved with a little bit of programming, where looking up 'Berber' language, for example, will not just yield sentences with the code of BER, but also KAB (Kabyle), RIF (Tarifit), and any other Berber language we may add in the future. This would thus maintain what user Amastan has wanted all along: a unified Berber language. At the same time, our new Kabyle users could continue to develop their own Kabyle corpus as they see fit.
And the same could be implemented with Arabic: when one searches 'Arabic', they should be able to see all sentences with the codes ARA as well as ACM, AFB, APC, ARQ, ARY, ARZ.
I wonder how difficult it would be to implement something like this.
Horus could do the following (tentatively, of course): If a sentence in ARA (Arabic) matches any sentence in its sub-languages (ACM, AFB, APC, etc as above), it could remove the sentence in ARA and keep the sentence in the sub-language (otherwise ARA could in theory contain EVERY sentence in EVERY one of its sub-languages).

I'm not well aware of the situation with Berber languages but as a learner of Modern Standar Arabic I think it's not a good solution. The dialects are quite different, they have different sentence structures, totally different grammar and a bit different pronunciation for letters and words. They may share the same vocabulary, but you may find the same words used to indicate slightly different meanings as well as different words for something. For example, if I'm looking for Arabic translations for "tomorrow" I don't want to get "بكرة" instead of "غداً".

'Arabic' and 'Modern Standard Arabic' would be two different things. I actually recommended to add 'Modern Standard Arabic' (ISO 639-3 ARB) some time ago, but this situation hadn't gotten to the point that it is now, and most Arabic users seemed to be against it. I actually still believe it important to have 'Modern Standard Arabic' added as a separate language, and change the language icon of all sentences in 'Arabic' (ARA) to 'Modern Standard Arabic' (ARB), and reserve the classification, 'Arabic' (ARA), for Arabic sentences not in the Modern Standard nor in any other variant that we currently have in Tatoeba (which WOULD mean to manually change back the language icon in some sentences which would have been changed to 'Modern Standard Arabic' in the process).
On the other hand, if you want to study Modern Standard Arabic, then you would look up 'Modern Standard Arabic' and get ONLY sentences under the 'Modern Standard Arabic' label. But looking up 'Arabic' would bring back results of not only 'Modern Standard Arabic', but also 'Egyptian Arabic', 'Iraqi Arabic', etc.

Well, this way I think it wouldn't do any harm except that it might be confusing for the newbies. Now most "Arabic" sentences are indeed MSA sentences. But I'm not sure what the benefits would be. The varieties of Arabic are different enough to qualify as separate languages, unlike what happens with the varieties of Spanish.