Menu
When I search for "of the experience" (without quotes), here are some of the results:
The result of the experiment was inconclusive.
Tom did many of the experiments himself.
This is Professor Oak, the director of this experiment.
While there were results containing the word "experience," I don't think the results containing "experiment" should have been included. Is the search algorithm trying to blindly match the first 6 characters of a word?
Tatoeba uses a stem based search algorithm. Before starting the comparison, search-words are reduced to their "stems" by cutting off common suffixes - like in this case -ment and -ence (leaving just "experi")
Why this approach not work for Serbian? If I search a certain word, the search displays only exact matches, not including inclination and conjugation forms.
While the stemming algorithm for English is more sophisticated than just matching the first 6 characters of a word, no algorithm is perfect, and one is likely to find cases where any of the stemmers that our search engine provides fails to behave as one might expect.
As the wiki page "How to Search for Text" says ( https://en.wiki.tatoeba.org/art...w/text-search# ), our search engine supports stemming for the following languages: German, English, Finnish, French, Italian, Dutch, Portuguese, Russian, Spanish, Swedish and Turkish. Other languages do not. Writing a stemmer for a search engine is a nontrivial task. However, you can approximate stemming by using wildcards. For instance, "experim*" would find both "experiment" and "experiments".
I see. Thanks a lot for the suggestion!
Btw, Serbian inclination and conjugation are rather similar to Russian. Can the existed algorithm for Russian be modificated for Serbian?
Theoretically, yes, someone could do that. Note, however, that we get our search engines and stemmers from a project called Sphinx. For someone to make this change, they'd have to:
- be part of the Sphinx community
- be familiar with both Russian and Serbian
- understand how to do all the configuration work necessary to create a new stemmer
- have a substantial chunk of time available to work on it, including testing
I also note that the list of stemmers offered by the site doesn't seem to change much over time.
I see. Is it nesessary to be a programmer to work on it?
Well, you would need to be someone who is comfortable with configuration (writing files with a specific format, and so on).