menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
lbdx lbdx April 11, 2020, edited April 11, 2020 April 11, 2020 at 11:54:10 AM UTC, edited April 11, 2020 at 11:54:50 AM UTC link Permalink

As promised, you can now find at https://tatominer.imfast.io a list of words that don't appear much in Tatoeba's sentences although they are highly sought after by the users of the site.
I think it will help those of you who would like to enrich more efficiently the vocabulary available on Tatoeba.
The data on this page will be updated every Saturday.

{{vm.hiddenReplies[34793] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir April 11, 2020 April 11, 2020 at 2:19:55 PM UTC link Permalink

Nice tool. I would, naturally, appreciate Finnish there, and I am sure many others would enjoy having the languages they contribute in available.

{{vm.hiddenReplies[34796] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx April 11, 2020 April 11, 2020 at 6:12:20 PM UTC link Permalink

Thanks for your feedback! I just extended the support to 20 languages, including Finnish.

{{vm.hiddenReplies[34806] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir April 11, 2020 April 11, 2020 at 6:15:38 PM UTC link Permalink

Merci beaucoup.

Thanuir Thanuir April 14, 2020 April 14, 2020 at 1:32:28 PM UTC link Permalink

How does it handle stemming? As is, almost all the Finnish words are in their basic form and the search looks for the exact form.

That is, if someone has search for both "aikovat" and "aikoa", does it count as two searches of "aikoa"? Or maybe people only look for Finnish words in the basic form on Tatoeba.

Compare the number of found sentences here:
https://tatoeba.org/spa/sentenc...%3D%22aikoa%22
https://tatoeba.org/spa/sentenc...rom=fin&to=und

{{vm.hiddenReplies[34859] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx April 14, 2020 April 14, 2020 at 3:27:49 PM UTC link Permalink

All the words in the tables were searched on Tatoeba in this form. My code does not perform any stemming operation on the queries and all the searched words are treated in the same way. If the lemmas appear in priority, it is only because these forms are the most searched for.

Furthermore, to favour queries that have not been made by power users, only one word, one equal sign and two quotation marks are accepted in the content of a search. All queries containing extra words or special characters such as ^, |, $ are rejected from the counts.

I have also implemented a bot detection system to reduce the noise they generate on the search statistics. And a word is counted as searched only once per day anyway. This counting method is especially necessary for the English corpus which seems to be the most targeted by DDoS attacks and web crawlers.

{{vm.hiddenReplies[34861] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir April 14, 2020 April 14, 2020 at 5:01:41 PM UTC link Permalink

Thanks. It seems that people simply search for the basic forms more than others, then. Thanks for the throughout answer.

Yorwba Yorwba April 14, 2020 April 14, 2020 at 9:04:12 PM UTC link Permalink

I'm not sure whether queries containing multiple words are really indicative of power users. I'd expect inexperienced users to try searching for phrases or complete sentences they want to translate, and only narrow down their search if they don't get any results.

Some statistics:
- Of 10126751 total queries, 2684835 contained at least one space (26.5%).
- Of 2068632 unique queries, 745875 contained at least one space (36.1%).
Okay, most queries are indeed only for single words.
The most common queries with spaces appear to be bot activity: 28701 times "A BATTERY", 28085 times "A BIT RATHER THICK", ...
But there were also some innocent-looking ones (i.e. not in all caps). Here are the top 25 of these (I haven't checked how often they already occur on Tatoeba):
2569 air conditioner
2427 передай мне пожалуйста масло
2046 Au revoir
1784 emergency exit
1713 under -inflation
1638 driver seat
1613 power unit
1565 passenger seat
1511 lateral run-out
1470 parking lot
1314 to code
1282 diesel fuel
1271 lighting switch
1269 Per piacere
1233 right-hand drive
1228 service standard
1226 fuel gauge
1220 rear view mirror
1217 round-trip ticket
1179 door lock
1159 direct drive
1127 warming up
1119 ceiling lamp
1118 drop press
1114 card of goods

{{vm.hiddenReplies[34863] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx April 15, 2020 April 15, 2020 at 8:46:27 PM UTC link Permalink

Thanks for your remark. Unfortunately, the number of requests generated by bots far exceeds the number of requests that are in uppercase. Of all the queries in your ranking, none of them remain in the top 25 after detecting the most obvious bots behaviors:

thank you
what does
i agree
how are you
of course
look forward
i speak
or else
fogd be
good morning
i love you
excuse me
bless you
add up
where is
and you
pick up
went past
i hope
i am
take off
do you understand
looking forward
according to
go away

With the exception of one (I'll let you guess which one), all these queries seem relevant and are already well served by the corpus. If we go further down the ranking we can identify other expressions of several words that are not yet well represented. But they cohabit with other queries that would bring more redundancy than value to the list. So I will only add those that contain at least one word that appears less than five times in the corpus. For the current English list of underused phrases, these are the following expressions:

3322,to evaporate,1
3730,a cappella,3
4550,a priori,1
4969,a la carte,2
4976,a posteriori,1

Pfirsichbaeumchen Pfirsichbaeumchen April 15, 2020 April 15, 2020 at 10:45:08 AM UTC link Permalink

It says the page is known as a dangerous page: https://safeweb.norton.com/repo....io/&ulang=de.

{{vm.hiddenReplies[34864] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir April 15, 2020 April 15, 2020 at 11:31:20 AM UTC link Permalink

Samoin imfast.io: https://safeweb.norton.com/repo...s://imfast.io/

lbdx lbdx April 15, 2020 April 15, 2020 at 8:49:18 PM UTC link Permalink

This may be due to the fact that the imfast.io domain is shared by many static websites, including malicious ones. But don't worry you should be OK if you browse this web page!

{{vm.hiddenReplies[34870] ? 'expand_more' : 'expand_less'}} hide replies show replies
gillux gillux April 15, 2020 April 15, 2020 at 11:55:18 PM UTC link Permalink

Or maybe it’s because your page includes mixed contents (http ressources loaded from an https page): https://support.mozilla.org/en-...ocking-firefox

{{vm.hiddenReplies[34871] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx April 16, 2020 April 16, 2020 at 7:05:34 AM UTC link Permalink

You are right, the flags used to be loaded through simple http. This issue has been fixed by yesterday's update.

deniko deniko April 15, 2020 April 15, 2020 at 11:24:17 AM UTC link Permalink

That's really cool! Thanks for that.