Menu
As promised, you can now find at https://tatominer.imfast.io a list of words that don't appear much in Tatoeba's sentences although they are highly sought after by the users of the site.
I think it will help those of you who would like to enrich more efficiently the vocabulary available on Tatoeba.
The data on this page will be updated every Saturday.
Nice tool. I would, naturally, appreciate Finnish there, and I am sure many others would enjoy having the languages they contribute in available.
Thanks for your feedback! I just extended the support to 20 languages, including Finnish.
Merci beaucoup.
How does it handle stemming? As is, almost all the Finnish words are in their basic form and the search looks for the exact form.
That is, if someone has search for both "aikovat" and "aikoa", does it count as two searches of "aikoa"? Or maybe people only look for Finnish words in the basic form on Tatoeba.
Compare the number of found sentences here:
https://tatoeba.org/spa/sentenc...%3D%22aikoa%22
https://tatoeba.org/spa/sentenc...rom=fin&to=und
All the words in the tables were searched on Tatoeba in this form. My code does not perform any stemming operation on the queries and all the searched words are treated in the same way. If the lemmas appear in priority, it is only because these forms are the most searched for.
Furthermore, to favour queries that have not been made by power users, only one word, one equal sign and two quotation marks are accepted in the content of a search. All queries containing extra words or special characters such as ^, |, $ are rejected from the counts.
I have also implemented a bot detection system to reduce the noise they generate on the search statistics. And a word is counted as searched only once per day anyway. This counting method is especially necessary for the English corpus which seems to be the most targeted by DDoS attacks and web crawlers.
Thanks. It seems that people simply search for the basic forms more than others, then. Thanks for the throughout answer.
I'm not sure whether queries containing multiple words are really indicative of power users. I'd expect inexperienced users to try searching for phrases or complete sentences they want to translate, and only narrow down their search if they don't get any results.
Some statistics:
- Of 10126751 total queries, 2684835 contained at least one space (26.5%).
- Of 2068632 unique queries, 745875 contained at least one space (36.1%).
Okay, most queries are indeed only for single words.
The most common queries with spaces appear to be bot activity: 28701 times "A BATTERY", 28085 times "A BIT RATHER THICK", ...
But there were also some innocent-looking ones (i.e. not in all caps). Here are the top 25 of these (I haven't checked how often they already occur on Tatoeba):
2569 air conditioner
2427 передай мне пожалуйста масло
2046 Au revoir
1784 emergency exit
1713 under -inflation
1638 driver seat
1613 power unit
1565 passenger seat
1511 lateral run-out
1470 parking lot
1314 to code
1282 diesel fuel
1271 lighting switch
1269 Per piacere
1233 right-hand drive
1228 service standard
1226 fuel gauge
1220 rear view mirror
1217 round-trip ticket
1179 door lock
1159 direct drive
1127 warming up
1119 ceiling lamp
1118 drop press
1114 card of goods
Thanks for your remark. Unfortunately, the number of requests generated by bots far exceeds the number of requests that are in uppercase. Of all the queries in your ranking, none of them remain in the top 25 after detecting the most obvious bots behaviors:
thank you
what does
i agree
how are you
of course
look forward
i speak
or else
fogd be
good morning
i love you
excuse me
bless you
add up
where is
and you
pick up
went past
i hope
i am
take off
do you understand
looking forward
according to
go away
With the exception of one (I'll let you guess which one), all these queries seem relevant and are already well served by the corpus. If we go further down the ranking we can identify other expressions of several words that are not yet well represented. But they cohabit with other queries that would bring more redundancy than value to the list. So I will only add those that contain at least one word that appears less than five times in the corpus. For the current English list of underused phrases, these are the following expressions:
3322,to evaporate,1
3730,a cappella,3
4550,a priori,1
4969,a la carte,2
4976,a posteriori,1
It says the page is known as a dangerous page: https://safeweb.norton.com/repo....io/&ulang=de.
Samoin imfast.io: https://safeweb.norton.com/repo...s://imfast.io/
This may be due to the fact that the imfast.io domain is shared by many static websites, including malicious ones. But don't worry you should be OK if you browse this web page!
Or maybe it’s because your page includes mixed contents (http ressources loaded from an https page): https://support.mozilla.org/en-...ocking-firefox
You are right, the flags used to be loaded through simple http. This issue has been fixed by yesterday's update.
That's really cool! Thanks for that.