menu
Tatoeba
language
Εγγραφή Σύνδεση
language Ελληνικά
menu
Tatoeba

chevron_right Εγγραφή

chevron_right Σύνδεση

Εξερεύνηση

chevron_right Εμφάνιση τυχαίας φράσης

chevron_right Εξερεύνηση ανά γλώσσα

chevron_right Εξερεύνηση με βάση τον κατάλογο

chevron_right Εξερεύνηση ανά ετικέτα

chevron_right Εξερεύνηση ηχητικών αρχείων

Κοινότητα

chevron_right Τοίχος

chevron_right Λίστα όλων των μελών

chevron_right Γλώσσες των μελών

chevron_right Φυσικοί ομιλητές

search
clear
swap_horiz
search
lbdx lbdx 11 Απριλίου 2020, τροποποιήθηκε την την 11 Απριλίου 2020 11 Απριλίου 2020 - 11:54:10 π.μ. UTC, τροποποιήθηκε την 11 Απριλίου 2020 - 11:54:50 π.μ. UTC flag Report link Μόνιμος σύνδεσμος

As promised, you can now find at https://tatominer.imfast.io a list of words that don't appear much in Tatoeba's sentences although they are highly sought after by the users of the site.
I think it will help those of you who would like to enrich more efficiently the vocabulary available on Tatoeba.
The data on this page will be updated every Saturday.

{{vm.hiddenReplies[34793] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
Thanuir Thanuir 11 Απριλίου 2020 11 Απριλίου 2020 - 2:19:55 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

Nice tool. I would, naturally, appreciate Finnish there, and I am sure many others would enjoy having the languages they contribute in available.

{{vm.hiddenReplies[34796] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
lbdx lbdx 11 Απριλίου 2020 11 Απριλίου 2020 - 6:12:20 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

Thanks for your feedback! I just extended the support to 20 languages, including Finnish.

{{vm.hiddenReplies[34806] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
Thanuir Thanuir 11 Απριλίου 2020 11 Απριλίου 2020 - 6:15:38 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

Merci beaucoup.

Thanuir Thanuir 14 Απριλίου 2020 14 Απριλίου 2020 - 1:32:28 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

How does it handle stemming? As is, almost all the Finnish words are in their basic form and the search looks for the exact form.

That is, if someone has search for both "aikovat" and "aikoa", does it count as two searches of "aikoa"? Or maybe people only look for Finnish words in the basic form on Tatoeba.

Compare the number of found sentences here:
https://tatoeba.org/spa/sentenc...%3D%22aikoa%22
https://tatoeba.org/spa/sentenc...rom=fin&to=und

{{vm.hiddenReplies[34859] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
lbdx lbdx 14 Απριλίου 2020 14 Απριλίου 2020 - 3:27:49 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

All the words in the tables were searched on Tatoeba in this form. My code does not perform any stemming operation on the queries and all the searched words are treated in the same way. If the lemmas appear in priority, it is only because these forms are the most searched for.

Furthermore, to favour queries that have not been made by power users, only one word, one equal sign and two quotation marks are accepted in the content of a search. All queries containing extra words or special characters such as ^, |, $ are rejected from the counts.

I have also implemented a bot detection system to reduce the noise they generate on the search statistics. And a word is counted as searched only once per day anyway. This counting method is especially necessary for the English corpus which seems to be the most targeted by DDoS attacks and web crawlers.

{{vm.hiddenReplies[34861] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
Thanuir Thanuir 14 Απριλίου 2020 14 Απριλίου 2020 - 5:01:41 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

Thanks. It seems that people simply search for the basic forms more than others, then. Thanks for the throughout answer.

Yorwba Yorwba 14 Απριλίου 2020 14 Απριλίου 2020 - 9:04:12 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

I'm not sure whether queries containing multiple words are really indicative of power users. I'd expect inexperienced users to try searching for phrases or complete sentences they want to translate, and only narrow down their search if they don't get any results.

Some statistics:
- Of 10126751 total queries, 2684835 contained at least one space (26.5%).
- Of 2068632 unique queries, 745875 contained at least one space (36.1%).
Okay, most queries are indeed only for single words.
The most common queries with spaces appear to be bot activity: 28701 times "A BATTERY", 28085 times "A BIT RATHER THICK", ...
But there were also some innocent-looking ones (i.e. not in all caps). Here are the top 25 of these (I haven't checked how often they already occur on Tatoeba):
2569 air conditioner
2427 передай мне пожалуйста масло
2046 Au revoir
1784 emergency exit
1713 under -inflation
1638 driver seat
1613 power unit
1565 passenger seat
1511 lateral run-out
1470 parking lot
1314 to code
1282 diesel fuel
1271 lighting switch
1269 Per piacere
1233 right-hand drive
1228 service standard
1226 fuel gauge
1220 rear view mirror
1217 round-trip ticket
1179 door lock
1159 direct drive
1127 warming up
1119 ceiling lamp
1118 drop press
1114 card of goods

{{vm.hiddenReplies[34863] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
lbdx lbdx 15 Απριλίου 2020 15 Απριλίου 2020 - 8:46:27 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

Thanks for your remark. Unfortunately, the number of requests generated by bots far exceeds the number of requests that are in uppercase. Of all the queries in your ranking, none of them remain in the top 25 after detecting the most obvious bots behaviors:

thank you
what does
i agree
how are you
of course
look forward
i speak
or else
fogd be
good morning
i love you
excuse me
bless you
add up
where is
and you
pick up
went past
i hope
i am
take off
do you understand
looking forward
according to
go away

With the exception of one (I'll let you guess which one), all these queries seem relevant and are already well served by the corpus. If we go further down the ranking we can identify other expressions of several words that are not yet well represented. But they cohabit with other queries that would bring more redundancy than value to the list. So I will only add those that contain at least one word that appears less than five times in the corpus. For the current English list of underused phrases, these are the following expressions:

3322,to evaporate,1
3730,a cappella,3
4550,a priori,1
4969,a la carte,2
4976,a posteriori,1

Pfirsichbaeumchen Pfirsichbaeumchen 15 Απριλίου 2020 15 Απριλίου 2020 - 10:45:08 π.μ. UTC flag Report link Μόνιμος σύνδεσμος

It says the page is known as a dangerous page: https://safeweb.norton.com/repo....io/&ulang=de.

{{vm.hiddenReplies[34864] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
Thanuir Thanuir 15 Απριλίου 2020 15 Απριλίου 2020 - 11:31:20 π.μ. UTC flag Report link Μόνιμος σύνδεσμος

Samoin imfast.io: https://safeweb.norton.com/repo...s://imfast.io/

lbdx lbdx 15 Απριλίου 2020 15 Απριλίου 2020 - 8:49:18 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

This may be due to the fact that the imfast.io domain is shared by many static websites, including malicious ones. But don't worry you should be OK if you browse this web page!

{{vm.hiddenReplies[34870] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
gillux gillux 15 Απριλίου 2020 15 Απριλίου 2020 - 11:55:18 μ.μ. UTC flag Report link Μόνιμος σύνδεσμος

Or maybe it’s because your page includes mixed contents (http ressources loaded from an https page): https://support.mozilla.org/en-...ocking-firefox

{{vm.hiddenReplies[34871] ? 'expand_more' : 'expand_less'}} απόκρυψη απαντήσεων εμφάνιση απαντήσεων
lbdx lbdx 16 Απριλίου 2020 16 Απριλίου 2020 - 7:05:34 π.μ. UTC flag Report link Μόνιμος σύνδεσμος

You are right, the flags used to be loaded through simple http. This issue has been fixed by yesterday's update.

deniko deniko 15 Απριλίου 2020 15 Απριλίου 2020 - 11:24:17 π.μ. UTC flag Report link Μόνιμος σύνδεσμος

That's really cool! Thanks for that.