Thread #29404 - Tatoeba

Can Tatoeba's search engine just ignore Arabic diacritics? Because the search engine takes into account Arabic diacritics, countless sentences with diacritics are ignored in the search results when a word is searched without diacritics, and vice versa: when a word is search with diacritics, countless sentences that contain the same word but without diacritics, are ignored in the search results.

For example, when I search the word "الشعب" (without the 'shadda' diacritic that indicates gemination), this sentence is ignored in the search results:

https://tatoeba.org/eng/sentences/show/668466

This is a serious flaw on Tatoeba as far as the Arabic language is concerned. Because I will soon be adding literally tens of thousands of Arabic sentences to this website (almost exclusively translations of my own English sentences), I would love to see this issue resolved as soon as possible.

Thank you

hide replies show replies

Guybrush88 July 3, 2018 July 3, 2018 at 10:09:38 AM UTC

link

Permalink

I reported this on the bug tracker: https://github.com/Tatoeba/tatoeba2/issues/1595

hide replies show replies

OsoHombre July 3, 2018 July 3, 2018 at 10:37:48 AM UTC

link

Permalink

@Guybrush88

Thank you very much for doing this. Indeed, Arabic is one of the most important languages in the world. It is the official language of more than 20 countries, and one of the official languages of the UN. Yet it is very neglected on the Internet. It is absolutely necessary to solve this problem at least on this website that could, some day, become a valuable resource for Arabic-language learners and professionals.

gillux July 3, 2018 July 3, 2018 at 2:53:11 PM UTC

link

Permalink

Thank you for reporting this, OsoHombre. Yes, it’s possible to tweak the search engine to ignore Arabic diacritics. But we need your help because we don’t know Arabic, so we don’t know exactly what we’re doing when we try to make such changes.

First of all, you need to be aware that if we make the search engine ignore diacritics, visitors won’t be able to look up only "الشعب" or to look up only "الشّعْبُ", because these words would be treated as if they were the same word. In other words, searching for either of them would produce the same results, and these results would include sentences with both spelling. So we need to make sure that such change doesn’t get into the way of many people who would like to look up only "الشّعْبُ", for example. Now, if you’re sure that ignoring diacritics is fine, we need a complete list of all the diacritics that needs to be ignored, and we’ll tweak the search engine accordingly.

hide replies show replies

odexed July 3, 2018, edited July 3, 2018 July 3, 2018 at 3:06:34 PM UTC, edited July 3, 2018 at 3:09:00 PM UTC

link

Permalink

@OsoHombre @gillux
As a learner of Arabic, I would like the search engine to distinguish at least the shadda diacritics ّ, because the meaning is quite different. خرج would mean to leave and خرّج to make someone go out. The best solution to this problem would be to make the search engine adjustable, so that anyone could decide if he wants to distinguish it or no.

hide replies show replies

gillux July 3, 2018 July 3, 2018 at 8:28:49 PM UTC

link

Permalink

Thank you odexed. Your comment suggests that we shouldn’t blindly ignore all the diacritics.

> The best solution to this problem would be to make the search engine adjustable

I agree, but it’s not possible at the moment.

hide replies show replies

AlanF_US July 4, 2018 July 4, 2018 at 1:56:30 AM UTC

link

Permalink

I think handling Arabic better could be rather complicated, and might need to be done in several stages. It would be helpful if Arabic speakers could familiarize themselves with this Wikipedia page:

https://en.wikipedia.org/wiki/A...ipt_in_Unicode

and think about which items they'd like to (a) ignore or (b) treat the same.

For instance, the character shadda was mentioned above. If you search for "shadda" on that page, you'll see that its Unicode code point is U+0651. Apparently, at least some people do not think that it should be ignored in searches.

What about U+060C (Arabic comma): ،‬ ?
Or U+061F (Arabic question mark): ؟

Should they be ignored?

Also, should different forms of a letter that appear in the same row under "Contextual forms" be treated the same? For example, one row has different forms of bāʾ (0628, FE8F, FE90, FE92, FE91):

ب
ﺏ
ـب
ـبـ
بـ

Should those be treated the same when it comes to searches?

hide replies show replies

OsoHombre July 4, 2018 July 4, 2018 at 6:45:40 AM UTC

link

Permalink

@ all

Thank you very much for your interest in the issue.

@gillux & @odexed

I think that the shadda (the diacritic sign for gemination) shouldn't be ignored.

@AlanF_US

I had a look at the Wikipedia page but as an old-fashioned translator who knows virtually nothing about programming, I am afraid that there is very little I can do to help you technically.

I think that the punctuation marks (especially the question mark) shouldn't be ignored.

I also think that the contextual forms shouldn't be ignored either.

I don't know if you guys can try to handle this problem with this very little information then tests would tell us if it works well or not, or you need somebody who has some experience in this field, yet I just wonder where I can find you a person like that.

hide replies show replies

gillux July 4, 2018 July 4, 2018 at 10:16:10 PM UTC

link

Permalink

> I think that the punctuation marks (especially the question mark) shouldn't be ignored.

I’m surprised. If the meaning of the Arabic question mark is the same as in English, it should be ignored. Let me explain more in details the effect of ignoring a character in the search.

For each sentence, the search engine makes a list of all the words it contains. This way, when you look up a word, it can quickly find all the sentences that include it. This process of extracting all the words contained in a sentence is done by using two lists of characters. One list (let’s call it I) consists of all the characters a word can include. Another list (let’s call it E) consists of all the characters that are not part of any word. List E are the characters that separate words (like space, punctuation) while list I are the ones that make up words (like letters). Now let’s have a look at a concrete example:

List I = abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
List E = .:;,?!'()[]- (plus the space character)
Sentence = I'm old-fashioned?

In that sentence, the following 4 characters are in list E: '-? (and space).
All the rest is in list I.
Thus, the search engine finds 4 words: I, m, old, fashioned.
It works as expected because people can find the sentence using one of these 4 words.

Now, let’s take the same sentence and remove the question mark from list E. The search engine now finds 4 words: I, m, old, fashioned?
Here, the question mark is treated as a normal character, just like a letter. It’s part of the word. Consequently, if you look up "fashioned", you won’t find the sentence. You have to look up "fashioned?" instead.

When we say "character X should be ignored", it means "character X should be in list E". If you (or anybody else) can use that Wikipedia article as a reference list to sort out which characters should be in list E or list I, we can improve the search of Arabic sentences.

hide replies show replies

AlanF_US July 4, 2018 July 4, 2018 at 11:50:54 PM UTC

link

Permalink

This is an excellent description. However, at the risk of making things more complicated, I want to add two notes:

(1) Not all characters that are ignored will split up a word into multiple parts when they occur in the middle of that word. For instance, Hebrew vowels that occur in the middle of a word are ignored (for the purposes of search) without splitting up the word.

(2) We can specify which characters can be treated the same for the purposes of search. For instance, because we do case-insensitive search, we treat "A" and "a" the same, and treat "Á" and "á" the same. For Turkish, we treat "I" and "ı" the same, for Russian, we treat letter pairs such as "Ж" and "ж" the same, and so on. For Hebrew, there are five letters that have a special form at the end of a word, but our search treats them the same way as the non-final form. I know that Arabic has multiple forms of consonants (initial, medial, final, isolated) which seems like a more complicated but analogous situation to the consonants in Hebrew, which is why I suggested that perhaps they should be treated the same for the purposes of search.

gillux July 4, 2018, edited July 4, 2018 July 4, 2018 at 10:25:20 PM UTC, edited July 4, 2018 at 10:29:01 PM UTC

link

Permalink

Note that we currently ignore Arabic question mark as well as other punctuation characters. The complete list of code points currently treated as part of words (not ignored) can be found here: https://github.com/Tatoeba/tato...ll.php#L80-L82

The meaning of a line like:

'U+621..U+63a', 'U+640..U+64a'

Is: "treat characters from code point U+621 to U+63A as part of words, then ignore characters from U+63B to U+63F, and then treat characters from U+640 to U+64A as part of words"

hide replies show replies

OsoHombre July 5, 2018 July 5, 2018 at 10:42:17 AM UTC

link

Permalink

@gillux @AlanF_US

Can you try to select what is part of list I (all the letters, no matter which shape or position in the word they are) and what is part of list E?

I think that the table under the part "Contextual forms" is the one gillux needs for list I. The table only contains letters and no punctuation marks or diacritics of any sort.

Sorry if I am not being very helpful technically. gillux's explanation was crystal clear, but I don't know if I can help you efficiently. If you need me to find somebody who can help you, please let me know. I'll try and do my best to find you one.

Menu

Need some help?

Developers

About