I’ve been playing around with our default search ranking algorithm. I insist on the "default" part because that’s what the vast majority of visitors use. I also focus on searches that do not use double quotes or any special trick. Just plain words. Again because that’s what the vast majority of visitors use.
Our current way of ranking results is pretty basic: it searches for sentences that include all the words (eventually stemmed) and sort them by total number of words in the sentence.
A problem with this approach is that the order of the words is ignored. The top result of searching for "you go there" is "There you go!" because it’s a shorter sentence than "You may go there."
Ignoring word order is especially catastrophic on languages without word boundaries, like Chinese, because the searched characters are randomly reordered into something totally unrelated. For example, the results for "可不可" in Chinese are cluttered by irrelevant "不可something". Same for kana words in Japanese.
In order to address this problem, I tentatively tweaked the default ranking algorithm on https://dev.tatoeba.org/ into something that prioritize, in the following order:
1. sentences that contains an exact match (like if searching for ="you go there")
2. sentences having the "longest common subsequence" (LCS, )
3. sentences having the least number of words
However, I don’t know if this new ranking suits everyone out there. What do you think?
You can compare the search results on https://tatoeba.org/ (old ranking) and https://dev.tatoeba.org/ (new ranking). You can run a search on tatoeba.org, and then add "dev." in the URL bar and press alt+return to open a new tab.
I do prefer a ranking that favors exact matches over stemmed matches. Longest common subsequence also sounds good. But sentences having the least number of words are often not the ones I want to see most. I prefer slightly longer ones that give me more context. For that reason, I always choose random ordering. It doesn't always put the sentences that I want at the very top, but at least I have a good chance of finding them without having to go through pages and pages of very short sentences. Also, providing a mix of sentences that is random with regard to sentence length lets people see more diversity. I think that's a good thing.
[not needed anymore- removed by CK]
Yes, I can imagine that a "minimum length" and a "maximum length" option would be useful. However, the nice thing about favoring sentences that meet a certain criterion rather than limiting them to that criterion is that if there are not enough sentences that meet it, you will automatically see the other ones without having to remove the criterion and do another search. I imagine that if I set a minimum and/or maximum length, I would often eliminate some of the fallback sentences I'd like to see, and then I'd have to do a follow-up search.
Indeed, I wouldn't want the default search to be optimized for my particular needs, which would be something like "favor sentences from five to ten words in length". However, I worry that optimizing it for choosing the shortest sentences would pessimize it for people like me, whereas leaving out a criterion of length would allow people to see a variety of sentences, short and long.
If I could choose and there were no computational costs:
1. An exact match of the sentence.
2. Sentences with exact match of the query as part of it.
3. Sentences with all the exact words, but possibly in different order or with other words between them.
4. Sentences with all the words, but with stemming and the order might be different etc.
5. Sentences with all but one of the words (with stemming and could be in any order).
6. Sentences with all but two of the words (with stemming and could be in any order).
7. And so on.
8. Sentences with even a single searched word (with stemming).
Random order within the categories. (Some of the categories could be sorted into even finer subcategories, but probably not worth it.)
For example search: haluan kalastaa tänään
1. Haluan kalastaa tänään.
2. Minä haluan kalastaa tänään, niin kuin eilenkin.
3. Tänään minä haluan kalastaa. Haluan kuitenkin kalastaa tänään.
4. Haluankin kalastaa tänään. Haluatteko te tänään tai huomenna kalastamaan?
5. Haluatteko elokuviin tänään?
6-8. Karhut kalastavat lohia.
The idea would be to first have the precise phrase and then to have increasingly distantly related phrases, which hopefully would still give some understanding of the involved words.
@Thanuir @AlanF_US @CK
Thank you for your feedback. I agree about the relative uselessness of having very short sentences showed first. The idea of randomizing the results within a category, like Thanuir said, is appealing (giving the order is deterministic), but I’m afraid it could be a little bit confusing. I temporarily set up https://dev.tatoeba.org/ like that, please let me know what you think.
Or, if we are to rank using the number of words, what would be the ideal number? Not too long and not too short. It depends of the language of course. Here are some stats about the average number of words per sentence in every language on Tatoeba: https://gist.github.com/jiru/81...5917dc18325fc2
I wonder if we could use these numbers to boost the ranking of sentences having a number of words close to the average, with a formula like rank = –abs(average – words)
> Or, if we are to rank using the number of words, what would be the ideal number?
When I look at the lists of sentences that I've compiled for my own learning, the length of the sentences does tend to be pretty close to the average for the language (5+ for both Hebrew and Russian).
One issue with randomness is that reproducing problems or strange behaviour would be more difficult. Maybe displaying the sentences newest first would be an alternative that adds some amount of randomness while retaining reproducability?
Using the average number of words, as suggested by @gillux, would create an alternating pattern that @CK suggests, but without the initial emphasis on slightly longer sentences. It would not be terribly difficult to create a function that would have the type of behaviour that @CK wants when the absolute value in gillux's formula was replaced by it. The function would have to be a piecewise defined function with three linear pieces, or a more complicated one. I do not know how computationally expensive it is to deal with a piecewise defined function.
One thing I belive would be good would be to show sentences that contain most of the right words, but not all. For example, I tried searching for the English idiom "to talk through one's hat" using two queries: "talk through one's hat" and "He is talking through his hat." with no results. Presumably the sentences are not there. But if the sentence with "Tom" rather than "he" was there, or the sentences with she/her was there, I would not find it. Or the sentence with you or I.
The use case for searching for the idiom might be that I am trying to understand it or that I would like to translate it. Both would be helped by the search finding sentences which match only some of the words in the search query, but presenting them after the sentences with all the words.
Yes, we definitely need randomness to be reproducible (and unpredictable, to avoid "rank boost threats") if this is the direction we’re taking. If I give you a search result URL, I expect that you see the same results as me, and that it stays more or less the same for a little time. I believe it is technically possible to produce a random, deterministic and unpredictable order.
I am concerned that boosting sentences having a number of words close to average is going to be detrimental to diversity, because it’s a incentive for contributors to produce standard-sized sentences. Isn’t there a risk of uniformization of the corpus? Or do we actually want more example sentences that are "efficient" and "standard"? I’d like to know @TRANG's opinion on that matter.
The idea of showing newest sentences first (after sorting by exact matches and LCS) is interesting. It surely adds some randomness, but it’s also an incentive to produce new sentences, and it gives more exposure to new or active users.
Since there is no consensus on an alternative to sorting by number of words, for the moment I’m going to change the default search ranking the way I described on the first post of this thread. We may further improve it later on.
I did not completely follow the conversation, but I'd like to answer the following
> I am concerned that boosting sentences having a number of words close to average is going to be detrimental to diversity, because it’s a incentive for contributors to produce standard-sized sentences. Isn’t there a risk of uniformization of the corpus? Or do we actually want more example sentences that are "efficient" and "standard"? I’d like to know @TRANG's opinion on that matter.
From my personal experience, that would definitely uniformize the corpus in its shape (not necessarily in its content). As you said, that's not so difficult to imagine that sentences with a number of words close to the average of a language would produce a huge amount of "standard-size" contributions.
For Latin languages for example, we would probably have something close to
Subject + Verb + Article + Adjective + Complement
and still from a personal point of view, that would provide SO many similar and boring sentences to translate, I'm pretty convince that would hinder my contributions. The most interesting sentences are CLEARLY NOT the one around the average. Well of course, you may encounter some nice expressions, or interesting words, but the vast majority would be "I borrowed a pen to Tom." instead of more elaborated, interesting sentences.
I'm always translating from the English corpus, and when I do a lot of translating at the same time, I always end by skipping several sentences because I feel like "AGAIN this sentence ?!" I can only imagine my feeling if the search would be biased to serve me more similar sentences... (I know English is special, but I guess the problem would be similar for the TOP 10 language at least).
> I am concerned that boosting sentences having a number of words close to
> average is going to be detrimental to diversity, because it’s a incentive
> for contributors to produce standard-sized sentences.
The main factor for a diverse corpus is to have a diverse group of contributors, in my opinion. Next to that, the search ranking probably has very little impact on the kind of sentences that people create.
If contributors were paid every time their sentences are displayed in a search result, then I guess that would be a high enough incentive to produce sentences based on the ranking. But even then, unless they earn a living out of it, I think they will still naturally produce standard-sized sentences no matter the ranking because it's just easier to produce such sentences.
So no worries about influencing diversity here.
What we have to consider is: what is the default usage of the search that we're trying to cover?
My personal usage:
- I'm trying to figure out how to say something in a foreign language and I'm missing vocabulary or grammar knowledge.
- I saw a new word/phrase in a foreign language and I want to understand its meaning or see examples of how it's used.
For these use cases, shorter sentences are in general easier to analyze. So it makes sense to order by number of words.
But if the sentence is too short, it may be lacking context and may not be as useful as a longer sentence. So prioritizing average-sized sentences could make sense.
But average-sized sentences might not always be the most useful for everyone either. Randomizing the results also makes sense: it simply means we don't want to make assumptions about what size is "best".
Random order sounds appealing actually, but I wouldn't change to that until we gather more specific information about the issues of ordering by number of words.
I looked at the pageviews for the search in Google Analytics for the month of April.
- Pageviews with order=words: 12,990
- Pageviews with order=random: 8,339
- Pageviews with order=created: 1,198
- Pageviews with order=modified: 259
(Total pageviews for /sentences/search: 223,174)
It seems that when given the choice, people choose in majority to order by words.
Thank you for the numbers, that’s valuable information.
This shows that 90% of the visitors making a search are using the "simple search" (top bar or front page), and 10% the advanced search (advanced search page or "more search criteria" block).
> It seems that when given the choice, people choose in majority to order by words.
However it’s not a fair choice because you can’t tell the visitors who made a choice (clicking on the dropdown, examining the choices and choosing) from the ones who didn’t (glancing over the dropdown or not even seeing it, and using the default value). order=words being the default, I believe it is overrepresented.
I find the number for order=random surprisingly high.
personally, I use order=random in the advanced search because generally I see from there more diversity (whether I use it for translating sentences or for tagging existing ones), since, in a general way, sorting by word order might present similar patterns (which I use whenever I want to translate or tag the same pattern)
I use the "fewer words first" mode when I use tatoeba to actually find a way to translate something, and the "random" mode when searching for sentences to translate.
If I need exact matches, I still use "fewer words first", but also the syntax like ="gesundheit"
I've extracted the stats since January 2018 to have a broader view on advanced search usage:
I would have thought that the order=random was high because it's the default option on the "Translate sentences" page, but it was high even before.
November 2018 is when we changed the default option on the "Translate sentences" from "created" to "modified" to "random" (cf. https://github.com/Tatoeba/tatoeba2/issues/1351). It's actually interesting to see how that influenced the order=created and order=modified.
I'm not sure why the "random" option spiked so much in March...
But in any case, there has been a few months where the advanced search was used more often with order=random than with order=words, even though order=words is the default.
Very interesting! Based on this data, I suggest that we ditch order=created and modify the "relevance" algorithm to randomize results within exact matches and LCS matches.
Yes, we can definitely introduce randomness into the "relevance" algorithm. At the very least we can experiment with it for a month or two, see if anyone complains and then can check against the analytics to see if we do keep it like that or not.
I'm not entirely sure about removing order=created, but I cannot think of a use case where ordering by created does something more useful than ordering by modified. I can only think of one inconvenience: if someone has bookmarked or has shared links to search results with order=created, those links would not work as initially intended anymore. I don't think that's a blocking issue for ditching order=created though, considering how little it is used.
[not needed anymore- removed by CK]
May I ask what is your use case?
Reproducible randomness is certainly possible from a mathematical/cryptographic point of view. You do need to think carefully about whether it's possible to reverse-engineer personal information from generated pseudo-random (permalink) seeds, though.
[not needed anymore- removed by CK]
Thanks. I reverted it back to normal.
An alternative sorting algorithm would be sorting by "vote". I remember once I read in the discussions on the wall that there was a suggestion to implement a voting system ( positive or negative vote) that permits the users to vote the sentences: here are some advantages:
1- Good sentences or high quality sentences (in any sense you consider) are likely to have more positive votes
2- Bad sentences are likely to have more negative votes
In this way, we can have an alternative sorting algorithm (not necessarily for the default sorting, but just an alternative sorting).
This would be more relevant when there are more people rating sentences. If implementation is not difficut, then one could, of course, do it already now, even though most sentences have between zero and two ratings of any kind.
Also, sometimes one might want to see bad sentences (to fix them or neutralize them). This is not really a concern with the default search, but might be with custom searching.