clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search
DostKaplan DostKaplan 2018-01-07 06:34 2018-01-07 06:34:07 link permalink

When I search for "of the experience" (without quotes), here are some of the results:

The result of the experiment was inconclusive.
Tom did many of the experiments himself.
This is Professor Oak, the director of this experiment.

While there were results containing the word "experience," I don't think the results containing "experiment" should have been included. Is the search algorithm trying to blindly match the first 6 characters of a word?

{{vm.hiddenReplies[28818] ? 'expand_more' : 'expand_less'}} hide replies show replies
brauchinet brauchinet 2018-01-07 08:34 2018-01-07 08:34:08 link permalink

Tatoeba uses a stem based search algorithm. Before starting the comparison, search-words are reduced to their "stems" by cutting off common suffixes - like in this case -ment and -ence (leaving just "experi")

{{vm.hiddenReplies[28819] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 2018-01-07 16:14 2018-01-07 16:14:12 link permalink

Why this approach not work for Serbian? If I search a certain word, the search displays only exact matches, not including inclination and conjugation forms.

AlanF_US AlanF_US 2018-01-07 18:17, edited 2018-01-07 18:22 2018-01-07 18:17:01, edited 2018-01-07 18:22:04 link permalink

While the stemming algorithm for English is more sophisticated than just matching the first 6 characters of a word, no algorithm is perfect, and one is likely to find cases where any of the stemmers that our search engine provides fails to behave as one might expect.

As the wiki page "How to Search for Text" says ( https://en.wiki.tatoeba.org/art...w/text-search# ), our search engine supports stemming for the following languages: German, English, Finnish, French, Italian, Dutch, Portuguese, Russian, Spanish, Swedish and Turkish. Other languages do not. Writing a stemmer for a search engine is a nontrivial task. However, you can approximate stemming by using wildcards. For instance, "experim*" would find both "experiment" and "experiments".

{{vm.hiddenReplies[28821] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 2018-01-07 18:50 2018-01-07 18:50:10 link permalink

I see. Thanks a lot for the suggestion!
Btw, Serbian inclination and conjugation are rather similar to Russian. Can the existed algorithm for Russian be modificated for Serbian?

{{vm.hiddenReplies[28822] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 2018-01-08 22:47 2018-01-08 22:47:09 link permalink

Theoretically, yes, someone could do that. Note, however, that we get our search engines and stemmers from a project called Sphinx. For someone to make this change, they'd have to:

- be part of the Sphinx community
- be familiar with both Russian and Serbian
- understand how to do all the configuration work necessary to create a new stemmer
- have a substantial chunk of time available to work on it, including testing

I also note that the list of stemmers offered by the site doesn't seem to change much over time.

{{vm.hiddenReplies[28823] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 2018-01-09 08:26 2018-01-09 08:26:48 link permalink

I see. Is it nesessary to be a programmer to work on it?

{{vm.hiddenReplies[28824] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US 2018-01-13 19:34 2018-01-13 19:34:47 link permalink

Well, you would need to be someone who is comfortable with configuration (writing files with a specific format, and so on).