If you're looking for words to write example sentences for Tatoeba, then you should look at the arrows '<-- increase'. They indicate the words that should appear more in the Tatoeba Corpus. The second column of numbers indicates how many times that word should appear in Tatoeba in order to have the same frequency (proportionally) it has in the OpenSubtitles.com corpus.
I downloaded a file from http://opus.nlpl.eu/OpenSubtitles-v2018.php with 441.5M sentences in English and wrote a script to create a frequency list of words (the list in unformatted, i.e., don't -> don + t, etc). I did the same thing with all the sentences in English owned by native speakers here in the Tatoeba corpus. After comparing the two frequency lists, I compiled a file that gives you an idea of how the frequency of the words in Tatoeba/English compares to that of the OpenSubtitles/English corpus:
A sample of the file:
word \t occurrences in Tatoeba \t how many there should be (proportionally)
tom 359494 723.9471614662318
i 303688 284673.5980557653
to 294034 174335.06447556426
that 234251 105369.07199955775
the 199956 231733.21163537377 <-- increase
t 197499 101493.03356767741
you 176213 305392.1646249775 <-- increase
mary 142565 625.8323390462842
a 130684 149839.41303818647 <-- increase
do 120631 46119.76310532704
he 106705 57203.45769471371
is 106175 74813.6430558621
and 89045 104790.98523257893 <-- increase
was 76168 43791.06207458618
s 75846 152917.0533331018 <-- increase
in 73907 75330.66885961362 <-- increase
it 70138 140934.43701484663 <-- increase
of 68495 90226.5530729227 <-- increase
she 64731 27836.557998979915
be 60860 43264.01316312008
they 59631 32147.915102217754
me 56818 68010.83329256669 <-- increase
have 53539 48189.83034799129
said 53431 8299.380929768524
we 49383 72168.9096779763 <-- increase
know 49071 41497.24193626721
don 48290 43850.854478915724
for 47362 52364.353226404884 <-- increase
what 46221 73483.70680135512 <-- increase
didn 44793 11179.517051257248
think 42518 19066.90288432038
this 41598 60412.53263081548 <-- increase
are 37926 42973.06544886179 <-- increase
can 37084 39846.78412853941 <-- increase
with 36257 38338.81981074213 <-- increase
his 35673 17064.111051060263
on 35404 51944.97959418119 <-- increase
not 34972 43740.29882476427 <-- increase
her 33423 21676.844915118265
m 32329 48312.26568311476 <-- increase
my 31459 51048.86660395457 <-- increase
want 29463 20022.39637034032
told 28727 5296.513973554256
like 27819 30102.40119012533 <-- increase
at 27399 24364.425735303917
did 27199 17909.338412841887
has 25722 10076.924460296785
going 25540 15317.881523817805
as 24937 17656.581783517522
go 23405 28750.83973746701 <-- increase
the file: https://github.com/sidfc/Langua...ng_ordered.txt
There are also files for: deu, fra, spa, por, ita, pol, rus
These all are very common words with huge numbers of sentences. Is there a particular reason for actively adding many more to get the same frequencies as the opensubtitles database?
[not needed anymore- removed by CK]
CK wrote: “Likely it would be more useful to find words that are high on word frequency lists that are missing from the Tatoeba Corpus. Perhaps you could generate such lists, putting the words in frequency order.”
I strongly agree with this suggestion. There are various frequency dictionaries for individual languages published by Routledge or by the Leipziger Universitätsverlag. Typically, these dictionaries list a large number of words (with definitions to prevent confusion with respect to homographs that have distinct meanings) — 5,000 or 10,000 or so. It would be nice to find words on the Routledge or Leipzig lists that are not yet found in the Tatoeba database for that particular language. These missing words could then be arranged in frequency order based on one of these published frequency dictionaries.
As CK notes, the order of the first 100 words or so is not very significant; the frequency depends to a great degree on the particular database selected — whether words are taken from newspaper text, from fiction works, from scientific papers, or transcribed from oral conversations. Indeed, when compiling concordances to works like the Bible or Shakespeare, etc., the most frequent 100 or so words in that corpus are placed on a “stop list” to be ignored by the computer preparing the concordance.
When learning a language, however, it can be very helpful to prioritize the most common words. Thus, even knowing 1,000 or 2,000 words can dramatically boost one's ability to understand that language and to speak fluently. Because Esperanto has extremely regular word-formation rules, a vocabulary of as few as 600 or 700 Esperanto words can be the equivalent of knowing 2,000 words in German, French, Russian or Spanish, etc.
An interesting article (at https://glanier.wordpress.com/2...arning-greek/) points out that introductory Greek courses often focus on the 310 most frequent words encountered in the New Testament, which enables a student “to read 80% of the NT without using a dictionary.”
If our user @MacGyver were able to generate, say, lists of the most common words (frequency order from 100 to 1,000) for English, Italian, Russian, Turkish and Esperanto — the five languages at Tatoeba which currently have the most sentences each) and compare the lists with the Tatoeba database, we would learn which particular high-frequency words are underrepresented in the Tatoeba database. Then contributors motivated to create sentences could try to focus on sentences utilizing those words. I think this might greatly improve the utlity of Tatoeba to language learners using the strategy of first learning the most frequently spoken words.
Quite unrelated, but I had to check what a concordance or compiling one means. Could you add a sentence or two to this effect to Tatoeba? These would be precisely the kind of material that an advanced learner finds useful.
I also added "compile" and "concordance" to my vocabulary.
I got a few lists of words online and compiled the following files:
A list with the top ~3k words ordered by frequency of occurrence in the OpSub corpus: https://github.com/sidfc/Langua...atoeba_v01.txt
A list with the top ~10k words ordered by frequency of occurrence in the OpSub corpus: https://github.com/sidfc/Langua...atoeba_v01.txt
They are organized as follows:
column 1 = the word
column 2 = occurrences of the word in OpenSubtitles.com (the file is ordered by this column)
column 3 = occurrences of the word in Tatoeba (only sentences by native users were considered)
column 4 = indicates how many times the word 'should' appear in Tatoeba
column 5 = it only shows for words that have less than 50% of the occurrences it 'should have' in Tatoeba
As an example, the words 'indistinct', 'limitation', 'restriction', 'annulment', 'inaudible', 'flare', 'abduction', 'depot', 'decoy', 'deposition', 'cheater', 'retainer', 'hypothetically', 'caress', 'rebound', 'sleepover', 'riddance', 'relive', 'proxy', 'onward', 'visitation', 'envoy', 'reptile', 'viewer', 'proclaim', 'retrieval', 'canvass', 'caterer', 'abduct', 'withhold', have ZERO occurrences in this site (considering only sentences by natives).
I only use OpSub to measure the frequency of words, not as a source of words (there are too many wrong words in there). So, there isn't much I can do in order to generate a good/useful list of most frequent words (in any language).
* UPDATE *
Now using data from the British National Corpus.
A list with the top 30k words ordered by frequency of occurrence in the BNC corpus: https://github.com/sidfc/Langua...atoeba_v01.txt
You need to look at the second column of numbers (third column from left to right) to find words that have a low number of occurrences in Tatoeba.
MacGyver - Would that be possible to generate a list of words in Portuguese that were not "posted" on Tatoeba?