menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
DostKaplan DostKaplan January 7, 2018 January 7, 2018 at 6:34:07 AM UTC link Permalink

When I search for "of the experience" (without quotes), here are some of the results:

The result of the experiment was inconclusive.
Tom did many of the experiments himself.
This is Professor Oak, the director of this experiment.

While there were results containing the word "experience," I don't think the results containing "experiment" should have been included. Is the search algorithm trying to blindly match the first 6 characters of a word?

{{vm.hiddenReplies[28818] ? 'expand_more' : 'expand_less'}} hide replies show replies
brauchinet brauchinet January 7, 2018 January 7, 2018 at 8:34:08 AM UTC link Permalink

Tatoeba uses a stem based search algorithm. Before starting the comparison, search-words are reduced to their "stems" by cutting off common suffixes - like in this case -ment and -ence (leaving just "experi")

{{vm.hiddenReplies[28819] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 January 7, 2018 January 7, 2018 at 4:14:12 PM UTC link Permalink

Why this approach not work for Serbian? If I search a certain word, the search displays only exact matches, not including inclination and conjugation forms.

AlanF_US AlanF_US January 7, 2018, edited January 7, 2018 January 7, 2018 at 6:17:01 PM UTC, edited January 7, 2018 at 6:22:04 PM UTC link Permalink

While the stemming algorithm for English is more sophisticated than just matching the first 6 characters of a word, no algorithm is perfect, and one is likely to find cases where any of the stemmers that our search engine provides fails to behave as one might expect.

As the wiki page "How to Search for Text" says ( https://en.wiki.tatoeba.org/art...w/text-search# ), our search engine supports stemming for the following languages: German, English, Finnish, French, Italian, Dutch, Portuguese, Russian, Spanish, Swedish and Turkish. Other languages do not. Writing a stemmer for a search engine is a nontrivial task. However, you can approximate stemming by using wildcards. For instance, "experim*" would find both "experiment" and "experiments".

{{vm.hiddenReplies[28821] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 January 7, 2018 January 7, 2018 at 6:50:10 PM UTC link Permalink

I see. Thanks a lot for the suggestion!
Btw, Serbian inclination and conjugation are rather similar to Russian. Can the existed algorithm for Russian be modificated for Serbian?

{{vm.hiddenReplies[28822] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 8, 2018 January 8, 2018 at 10:47:09 PM UTC link Permalink

Theoretically, yes, someone could do that. Note, however, that we get our search engines and stemmers from a project called Sphinx. For someone to make this change, they'd have to:

- be part of the Sphinx community
- be familiar with both Russian and Serbian
- understand how to do all the configuration work necessary to create a new stemmer
- have a substantial chunk of time available to work on it, including testing

I also note that the list of stemmers offered by the site doesn't seem to change much over time.

{{vm.hiddenReplies[28823] ? 'expand_more' : 'expand_less'}} hide replies show replies
Selena777 Selena777 January 9, 2018 January 9, 2018 at 8:26:48 AM UTC link Permalink

I see. Is it nesessary to be a programmer to work on it?

{{vm.hiddenReplies[28824] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 13, 2018 January 13, 2018 at 7:34:47 PM UTC link Permalink

Well, you would need to be someone who is comfortable with configuration (writing files with a specific format, and so on).