Thread #28818 - Tatoeba

When I search for "of the experience" (without quotes), here are some of the results:

The result of the experiment was inconclusive.
Tom did many of the experiments himself.
This is Professor Oak, the director of this experiment.

While there were results containing the word "experience," I don't think the results containing "experiment" should have been included. Is the search algorithm trying to blindly match the first 6 characters of a word?

hide replies show replies

brauchinet January 7, 2018 January 7, 2018 at 8:34:08 AM UTC

link

Permalink

Tatoeba uses a stem based search algorithm. Before starting the comparison, search-words are reduced to their "stems" by cutting off common suffixes - like in this case -ment and -ence (leaving just "experi")

hide replies show replies

Selena777 January 7, 2018 January 7, 2018 at 4:14:12 PM UTC

link

Permalink

Why this approach not work for Serbian? If I search a certain word, the search displays only exact matches, not including inclination and conjugation forms.

AlanF_US January 7, 2018, edited January 7, 2018 January 7, 2018 at 6:17:01 PM UTC, edited January 7, 2018 at 6:22:04 PM UTC

link

Permalink

While the stemming algorithm for English is more sophisticated than just matching the first 6 characters of a word, no algorithm is perfect, and one is likely to find cases where any of the stemmers that our search engine provides fails to behave as one might expect.

As the wiki page "How to Search for Text" says ( https://en.wiki.tatoeba.org/art...w/text-search# ), our search engine supports stemming for the following languages: German, English, Finnish, French, Italian, Dutch, Portuguese, Russian, Spanish, Swedish and Turkish. Other languages do not. Writing a stemmer for a search engine is a nontrivial task. However, you can approximate stemming by using wildcards. For instance, "experim*" would find both "experiment" and "experiments".

hide replies show replies

Selena777 January 7, 2018 January 7, 2018 at 6:50:10 PM UTC

link

Permalink

I see. Thanks a lot for the suggestion!
Btw, Serbian inclination and conjugation are rather similar to Russian. Can the existed algorithm for Russian be modificated for Serbian?

hide replies show replies

AlanF_US January 8, 2018 January 8, 2018 at 10:47:09 PM UTC

link

Permalink

Theoretically, yes, someone could do that. Note, however, that we get our search engines and stemmers from a project called Sphinx. For someone to make this change, they'd have to:

- be part of the Sphinx community
- be familiar with both Russian and Serbian
- understand how to do all the configuration work necessary to create a new stemmer
- have a substantial chunk of time available to work on it, including testing

I also note that the list of stemmers offered by the site doesn't seem to change much over time.

hide replies show replies

Selena777 January 9, 2018 January 9, 2018 at 8:26:48 AM UTC

link

Permalink

I see. Is it nesessary to be a programmer to work on it?

hide replies show replies

AlanF_US January 13, 2018 January 13, 2018 at 7:34:47 PM UTC

link

Permalink

Well, you would need to be someone who is comfortable with configuration (writing files with a specific format, and so on).

Menu

Need some help?

Developers

About