To be relevant, a corpus of example sentences should cover the words that are of most interest in each language.
However, even in English, Tatoeba is still far from achieving this. For example, the word 'compliance' is one of the top 50 most searched for words on the Linguee online dictionary but only has 6 sentences on Tatoeba. These sentences are only translated into seven languages and none of them into languages as important as Spanish, Portuguese or French. At the same time, 'dog' is only the 9357th most searched word on Linguee but has more than 5000 sentences on Tatoeba.
So it seems that we contributors, spend too much time creating and translating sentences that will never be read by non-contributing Tatoeba users. On the other hand, we probably don't pay enough attention to the words that are the most requested such as 'relevant', 'scope', enhance' or 'furthermore'.
Perhaps a system that better balances the supply and demand of sentences could be set up. Starting from the most searched words on Linguee or Tatoeba, we could suggest the most underused words to contributors who would like to maximize their impact on the corpus enhancement. For example, for a given language, we could highlight the 100 most requested words that do not have at least 20 sentences translated in the ten most important languages of the corpus.
Has a similar proposal ever been debated? Do you think it would be worth implementing?
Data source : https://www.linguee.com/english...ish/1-200.html
I think that it would be quite worthwhile to find a way to cover frequently requested but underused words. However, I believe that it would be best to do this through coordinated communication rather than by trying to change the site's software. It takes a long time to agree upon and implement software changes. Furthermore, if this initiative is successful, it will remove some of its own basis for existence, another reason we wouldn't want it "baked into" our software.
By scraping tatoeba.org, I identified the 20 most highly searched words on Linguee that are underused on Tatoeba (less than 20 sentences including the given word).
linguee_rank word nb_sentences
17 scope 19
23 compliance 6
44 leverage 12
54 furthermore 11
85 default 14
103 retention 2
107 facilitate 10
141 procurement 16
150 pending 13
205 stakeholder 2
207 vendor 16
259 invoice 17
266 alignment 11
268 incentive 15
277 framework 17
281 align 11
282 equity 7
290 gauge 19
294 venue 7
295 mitigate 12
Maybe CK could build a table from this data so that users can easily add new sentences and translations ?
If it gains traction, we could update this list on a weekly basis.
Note that it's possible to get all sentences as a single file from the downloads page https://tatoeba.org/eng/downloads and thanks to rumpelstilzchen's work it's now also easy to only download sentences in a certain language.
"Scraping tatoeba.org" sounds like you wrote a script, so using that file might be a more comfortable way to collect frequency statistics.
Good point. On the other hand, by scraping the data, you benefit from the Tatoeba advanced search based on text stemming right away.
You'll notice that "scope" actually has more than 19 sentences.
You forgot to include the orphans.
Orphans are excluded automatically on the assumption that they are less trustworthy than owned sentences, even though some are likely better than some non-native owned sentences.
Orphans are included in the exported data.
Haluaisiko joku englanninkielinen adoptoida, kenties muokkaamisen jälkeen, loputkin näistä orvoista?
A few things to consider:
(1) While words with many different senses, like "set", need a wide variety of sentences, even ten sentences is pretty good for getting an idea of how to use a word like "mitigate", as long as the sentences are not near-duplicates of each other. Of course, there are a lot of near-duplicates in our sentences, and the number keeps increasing.
(2) The stemmer misses some correspondences. Thus, a search for "retention" misses "retain", and a search for "retain" finds 36 hits.
(1) It is true that it is difficult to define a threshold suitable for both monosemic and polysemic words. It may be preferable to start with a low threshold at 10 and then raise it to 20 in a second round.
(2) If the stemmer is imperfect, it is probably better to use the exact match. This is the option chosen on this page: https://tatoeba.org/eng/Vocabul..._sentences/eng . Another advantage is that it makes it easier to process the data in the script.
With these parameters, we get the following list for English:
23 compliance 7
44 leverage 9
103 retention 1
107 facilitate 7
141 procurement 1
205 stakeholder 1
206 comprise 9
228 evaluation 9
266 alignment 4
268 incentive 9
281 align 4
282 equity 8
294 venue 5
295 mitigate 2
299 liability 9
314 preliminary 6
321 hub 9
337 offset 2
343 amend 4
344 retrieve 9
I think Linguee and Tatoeba are too different projects to be directly compared.
You are talking about "words that are of most interest". But to who? It depends on the person, on your mother tongue and culture, on what you are looking for etc. Does having more people querying a word really make that word more important? (open question)
You are talking about "languages as important as Spanish, Portuguese or French", but what makes them so important? Again, it depends on the person, on your mother tongue and culture, on what you are looking for.
Linguee is good at answering the demand of the market of translations, but that is a specific and restricted view on languages. For example, Linguee will help me a lot if I have to translate legal papers, but it won’t help me if my job is to create subtitles. Tatoeba has a broader approach that makes it, by design, less efficient on specific fields.
If I am a translator working with English, Spanish or French, I’d rather use Linguee than Tatoeba to help me with a translation job. In contrast, if I am a beginner learning Esperanto, or even Portuguese, I’d rather use Tatoeba. Or if I’m looking for sentence pairs I can legally re-use in my online courses or learning app, I’d rather use Tatoeba.
Note that I am not trying to denigrate your comment. To the contrary, I think you are raising important questions. I just want to show that you (as well as everyone, including me) have a bias about what is "relevant" or "important" or "should be on Tatoeba".
You are right, Linguee is different from Tatoeba. If I used Linguee's rankings, it's for lack of a better one. Of course, if Tatoeba was mainly used to translate Estonian subtitles into Esperanto, then more Estonian sentences would have to be created. ^^
By the way, it would also be interesting to know which words are the most searched for on Tatoeba. Is such data compiled somewhere?
In any case, the vocabulary related to the business world is not very covered by the contributors, and this is detrimental to the completeness of the corpus. I am pretty confident that for a developer "looking for sentence pairs I can legally re-use in my online courses or learning app", exhaustiveness or at least large diversity is not a detail.
We have logs of all queries made to the search engine up to one year ago, but it’s just raw data and it’s not publicly available. I can anonymize this data and put it online if you want to play around with it.
Having access to this data would be great. I would enable me to update my table with even more pertinent data.
I compiled the search query logs into a CSV file: https://downloads.tatoeba.org/s...ueries.csv.bz2
Note that for technical reasons each query is reported twice, so you want to divide numbers by two.
I didn’t include empty searches. I also only included queries on page 1 (that is to say, browsing page 2, 3, 4 etc. of the search results is not counted as actual queries).
Thank you gillux for responding so quickly. I have started to analyze the data and I can already confirm that the ranking of the English queries is very different from the Linguee one. Here is the top10:
There is still a lot of data cleaning to be done to be able to find the queries that don't get sentences on Tatoeba, but I think we will come up with something usable. I'll keep you posted.
From an English learner's point of view, lists such as the NGSL are likely much more useful.
Here is an easy way to browse sentences from the Tatoeba Corpus in 2014.
At the time I built this set of pages, List 907 had a little over 300,000 sentences and 123,278 of these had audio.
There are now over 750,000 sentences on List 907 and over 490,000 English sentences with audio. My plan is to rebuild these pages after I get to 500,000 English sentences with audio.
Eri kohdeyleisö, luultavasti. NGSL luultavasti aloittaa yleisimmistä sanoista (jotka on helppo oppia mistä tahansa), sinnä missä aloitusviestissä mainittu lähde näyttää käsittävän sanoja, joiden merkitys ihmisille on epäselvä. Oletettavasti niistä kiinnostuneet ihmiset ovat jo päässet aloittelijasanojen ongelmista ohi.
I'm not exactly sure what you're saying since machine translation doesn't always correctly translate things, but perhaps you missed several of the other lists on that site, and the fact that my 2014 set of pages listed some of the upper-level word families numbered-sequentially after the main general list word families.
NEW ACADEMIC WORD LIST
TOEIC WORD LIST
BUSINESS SERVICE LIST
Briefly, for those who don't want to read the whole NGSL website...
A person mastering the 2,368 "word families" (not just single words) should be able to understand up to 92% of what they read.
A word family is something like this.
friend, friends, friendly, unfriendly, friendless, friendship
If you also can master the additional 960 word families on the NAWL, you should be able to handle college textbooks, academic journals, etc. About 92% coverage, since you will also need to learn vocabulary specific to your field of study.
The Business Service List is similar to the NAWL, except it's aimed at business English. An additional 1,700 "word families" over the 2,368 on the NGSL should give you about 97% coverage.
The main purpose of such lists is to help people studying English focus on the basic "word families" that are most frequently used, so people can get to the point where they can understand a very high percentage as quickly as possible.
(This is just a brief explanation. Read the website for more details.)
Selvä. Vilkaisin sivustoasi ja siinä näkyi pääosin yleislistan sanoja. Kyllä ne muutkin listat sieltä näyttävät löytyvän, mutta niidenkin sanasto näyttää aika yleiseltä, mikä lienee niiden tarkoituskin.
And of course, you are welcome to add more sentences based on your idea of what words should be better covered on Tatoeba. ^^
That's what I started doing.
With this suggestion, I'm just trying to contribute by publishing landmarks for those who lack inspiration and would like to make sure that they create or translate sentences that people want to read.
Just two points:
- You could add vocabulary items for words that you want more represented. Right now, the vocabulary feature is difficult to use efficiently (but it will change!) but that way you can directly follow the number of sentences for each of the words you're interested in and see what sentences exist. While your limit was 20, the limit for "sentences wanted" is 10, meaning that words appearing in more than ten sentences will not appear on the "wanted" page.
- Please do not add sentences from Linguee directly ^^ (except those whose source is compatible with Tatoeba)
I'm glad you mention this feature and learn that it will be improved soon. I'm already using it and I like it a lot. While browsing it for different languages, I noticed that it could perhaps be improved in two ways.
- Sometimes the requested expressions are formulated in such a way that it is impossible to answer the request: spelling mistakes, use of the equal sign to specify the meaning of the word... By limiting the duration of the request, these complicated expressions could be evacuated naturally.
- I also noticed that for some languages, this feature is not very used. Maybe the lists could be automatically filled with the underused expressions searched on Tatoeba at least x times during the last y days by different users. This option would have the advantage to better connect simple users with the corpus contributors and to be applicable for all languages.
Thanks to your feedback, I have created a new tool that can be found at https://tatominer.imfast.io.
It will help contributors who want to diversify Tatoeba's vocabulary while paying special attention to expressions that are often searched for by users of bilingual dictionaries.
This is nice. Would it be possible to provide a drop-down to select the language? Most people will probably only be interested in one language at a time. While it's possible to sort by language and then page to the place where sentences in the desired language starts, it's a cumbersome process.
There is a search box on the top-right corner of the table that enables you to filter content. For example, typing 'eng' will filter all English sentences.
That works, but it's not the behavior I would have expected. For one thing, searching for "eng" also brings up non-English words containing "eng".
> That's the default for the datatables jQuery plugin.
Whether or not it's the default for the underlying software, expected behavior for a UI is for a search field with typed text to search for text that can vary freely, while a drop-down is used to search for constrained text. Not only does using the same search field for both violate my expectation, but as I said, it mixes up the results. If I search for "eng", I find, in addition to the English words, the French word "engendrer". Similarly, a search for "fra" brings up a Portuguese and a German term that happen to contain the string "fra". It's only a matter of luck that this particular vocabulary list and list of language identifiers don't overlap more.
Naturally, the tool can be used in its current form, and I appreciate the fact that Ibdx set it up for us. I'm just suggesting that separating the specification of the language from the specification of the entry would make the tool more usable.
I may have a look at it some time to time. At what frequency do you plan to update the data? (If you plan it at all)
That's a good tool to get some inspiration sometimes.
[EDIT] Also note that some of your entries are incorrect, but that's not an error on your side. For example, there are obvious ones like "linguee" or "so" and the ones erroneous because of linguee, like "oeuvre" that should be "œuvre"
Updating this web page is not really an issue for me since the process is fully automated. As both Tatoeba download page and Linguee rankings are updated weekly, a new version every Saturday seems doable.
As for the few errors, you rightly noticed that they come from the Linguee data. Sadly, I don’t see any easy way to spot them programmatically for now.
I added sentences for 50 of the words.
Thank you, the quality of your sentences is of great benefit to Tatoeba.
Thanks! Your list is very helpful. I ended up writing a sentence for each phrase. I hope others will do the same.
One thing I note is that the phrase "portugues" (in Portuguese) is on the list. This is simply a misspelled version of "português". I would expect the correctly-spelled version of a language's name within that language to be fairly common, so I would be surprised if it would appear on the list. I didn't encounter any English misspellings, though I did run into the rare words "wether" and "therefor", which may be frequent in queries only as misspellings of the more common "whether" and "therefore".
I also noticed that there are too many misspelled words among the phrases with zero occurrences in the Tatoeba corpus. This is the reason why I will cut all of them in a next version of the table.