menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search
MisterTrouser MisterTrouser August 6, 2020 August 6, 2020 at 1:40:19 PM UTC link Permalink

Short Enhancement request:
The search "^Tom * * * Mary$" searches sentences with 5 words - 1 word per star.

However: "^Tom * * *" or "* * *" does not search for 1 word per star.

Suggestion:
Make it possible to search for sentences with word counts through star numbers

Reason:
Consistency

Disclaimer:
I did not check if this would interfere with other search options.

{{vm.hiddenReplies[35723] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US August 6, 2020, edited August 6, 2020 August 6, 2020 at 5:00:13 PM UTC, edited August 6, 2020 at 5:00:51 PM UTC link Permalink

Rather than give a general reason like "consistency", could you please give a broader picture as to what you're trying to achieve, and why the current behavior doesn't provide it for you? By "what you're trying to achieve", I don't just mean "I'm trying to find sentences that start with 'Tom' and contain exactly five words, without my having to specify the last one". I mean why, in the bigger picture, do you want to find sentences that start with 'Tom' and contain exactly five words, and given this reason, why is it a problem to specify the last word?

Note also that the basic syntax of searches is determined by our search engine (Manticore), not us. There are occasions when we do some preprocessing on our own side (for instance, when the search query ends with a question mark that is intended as punctuation, not a wildcard), but those are very rare.

{{vm.hiddenReplies[35725] ? 'expand_more' : 'expand_less'}} hide replies show replies
MisterTrouser MisterTrouser August 6, 2020 August 6, 2020 at 11:25:44 PM UTC link Permalink

I personally would use it to search sentences with a specific number of words. So I personally would use it as "* * * * *" to find a three-word sentence.
Why? Because in many cases searches hit too many single words ("ouch!") or two word sentences ("go away"). If their count exceeds 1000, the search result is useless.

> why is it a problem to specify the last word?
Because I there are many possible last words?
Ok, maybe something similar to
^(a*|b*|c*|d*|e*|f*|g*|h*|i*|j*|k*|l*|m*|n*|o*|p*|q*|r*|s*|t*|u*|v*|w*|x*|y*|z*) * * * (a*|b*|c*|d*|e*|f*|g*|h*|i*|j*|k*|l*|m*|n*|o*|p*|q*|r*|s*|t*|u*|v*|w*|x*|y*|z*)$

might work. But that kind of search is too heavy.

Imagine you want to study the word "go". You might search for "* * go * *" to find "usable" example sentences you can use to study with. You cannot get more than three words:
https://tatoeba.org/eng/sentenc...o=und&page=100
Or less than too many:
https://tatoeba.org/eng/sentenc...sort=relevance

Regarding the pure word count search I had a conversation with C K before (where I thought I might solve the problem myself).

The main reason for posting again is the added reason of search syntax consitency ( I know, repeat, but I added clearer examples this time):

From Wiki:
> This example finds English sentences that have "Tom", then two words, then "Mary", then one word, and then "John."

> "Tom * * Mary * John"

> This example finds English sentences that start with "Tom", then have three words, then end with "Mary".

> "^Tom * * * Mary$"

So why would "^Tom * * *" not work?

{{vm.hiddenReplies[35727] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir August 7, 2020 August 7, 2020 at 7:39:04 AM UTC link Permalink

Rajoittamalla kielen englantiin saat myös nelisanaisia lauseita, mutta tämä ei auta paljoakaan. Jos kiellät joitakin sanoja, lauseita on vähemmän ja näet myös pitempiä. Esimerkiksi kieltämällä muutaman nimen yli puolet lauseista on nelisanaisia: https://tatoeba.org/spa/sentenc...sort=relevance

Sen sijaan satunnainen järjestys antaa paljon keskipitkiä lauseita. Ehkä siitä on hyötyä sinulle?

{{vm.hiddenReplies[35729] ? 'expand_more' : 'expand_less'}} hide replies show replies
MisterTrouser MisterTrouser August 7, 2020 August 7, 2020 at 11:40:19 PM UTC link Permalink

Thanuir: That's a workaround that might 'accidentally' work. It's nothing to rely on. Especcially, because you don't know which words to forbid at the search (remember, we're talking probably about a language the user is not familiar with)

{{vm.hiddenReplies[35731] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK August 8, 2020 August 8, 2020 at 12:42:40 AM UTC link Permalink

I don't know if it is possible, but would something like the following work for you if this could be done?

In the advanced search have either both or one of these possibilities.

1. When the results are sorted by the number of words, have the option to only display sentences with X or more words. For example, 5 or more words, 6 or more words, 7 or more words, ...

2. No matter how results are sorted, have the option to skip the first so many results. For example, 500, 1000, 1500, 2000, 2500, ...

{{vm.hiddenReplies[35732] ? 'expand_more' : 'expand_less'}} hide replies show replies
MisterTrouser MisterTrouser August 10, 2020 August 10, 2020 at 5:47:41 AM UTC link Permalink

@CK
1. Would be the perfect solution for this particular problem.

2. Sounds as it works, but I think such an option is a bad idea from a UI standpoint.
Maybe allow more than 100 pages in the search results would easier to understand. One UI question here would be: how to (usefully) present (e.g.) 580 pages to the user ( https://tatoeba.org/eng/sentenc...rom=eng&to=und )

I'd guess (2) is easier to implement..?

{{vm.hiddenReplies[35753] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK August 10, 2020 August 10, 2020 at 6:48:06 AM UTC link Permalink

I know it's not exactly what you want, but I quickly threw together some pages with 4-word, 5-word, 6-word, 7-word & 8-word English sentences with audio.

http://study.aitech.ac.jp/sente...audio/x-words/

This is based on some programming that I'd already done, so some of the text on the pages may not be fine-tuned for this purpose.

{{vm.hiddenReplies[35754] ? 'expand_more' : 'expand_less'}} hide replies show replies
MisterTrouser MisterTrouser August 11, 2020 August 11, 2020 at 4:16:03 AM UTC link Permalink

Thank you for the work. But it looks like you were operating directly on the database?

Looking at gillux link, I guess the discussion is finishied :D

Thanuir Thanuir August 8, 2020 August 8, 2020 at 4:58:21 AM UTC link Permalink

Tarjosin ehdotuksena sinun ongelmaasi, jonka tuossa esitit. Se ei ole mitenkään ideaalinen, mutta toiminee purkkaratkaisuna, kunnes jotain parempaa keksitään.

Se toimii myös sikäli yleisesti, että rajoitteiden lisääminen oleellisesti aina vähentää lauseiden määrää. Niinpä, aina jos löytää vain liian lyhyitä tai pitkiä lauseita, voi vain lisätä rajoituksen. Jos rajoitti liikaa, poistaa rajoituksen.

gillux gillux August 10, 2020, edited August 10, 2020 August 10, 2020 at 4:01:39 PM UTC, edited August 10, 2020 at 4:05:47 PM UTC link Permalink

It looks like filtering by word count is the most straightforward way to solve your problem. This enhancement has been suggested before and the progress is tracked here: https://github.com/Tatoeba/tatoeba2/issues/1954

Edit: it was you who actually first suggested this. Just to make it clear, we didn’t forget, but we have other priorities for now. :-)

{{vm.hiddenReplies[35757] ? 'expand_more' : 'expand_less'}} hide replies show replies
MisterTrouser MisterTrouser August 11, 2020 August 11, 2020 at 4:17:01 AM UTC link Permalink

Thank you for providing the link gillux. At that time, it didn't sound like the bug was "written down" (to github) by anyone. So I guessed it was sorted out as "not important".

I should then appologize for making a fuss.