*Why and how to add sentences with words others want to see*
This is a collection of my thoughts about this subject; maybe someone else is also interested.
1. Go to https://tatoeba.org/swe/Vocabulary/add_sentences/ and choose your native (or very strong) language.
2. You have basically two options: Start from words that are part of no or few sentences (the beginning of the list) or from those that are part of a bit less than ten sentences (the end of the list).
3. Write a sentence or several sentences that use the words.
a. You are adding content that someone wants to see, or at least wanted to see.
b. You are adding to the diversity of the corpus. Since these words are not present (in great quantities), adding more sentences that use them makes the corpus more diverse, by one reasonable measure of diversity.
Furthermore, you are probably adding sentences that you would not otherwise have added; this is a collaboration between two people, one requesting a word and another adding a sentence. This creates more diverse sentences.
If several people add sentences that use a particular word, they are likely to add different types of sentences, hence increasing the diversity even more.
c. You are adding original sentences in the language. In some smaller languages, many sentences are translations. Translated sentences typically have simpler or at least restricted grammar and other distinguishing features. (When searching for more on this, "translationese" is a good keyword to begin with.) Having original sentences in all languages is valuable for the corpus and contributes to the diversity and the quality of the corpus. Of course, having translations is valuable, too, for obvious reasons.
A. Starting from the beginning of the list
Pros: You will be adding sentences that use words which are almost non-existent in the corpus.
Cons: You will meet the same word again and again. If the word appears in no sentences and you add one, then that word now has one sentence and you will face it again when contributing sentences to words with a single sentence only. And again when adding sentences to words with two sentences, etc.
(The page is not dynamic, but if you refresh it or open the next page, the words might have changed positions. Unless you open all the pages at once, this is likely to happen.)
Some languages have poor words - even phrases in other languages - at the beginning of the list. One has to scroll past them.
B: Starting from the end of the list, from words that appear in nine, eight, seven, etc. sentences
Pros: When you add a sentence that uses a word used nine times, it will (later and after a refresh) vanish from the page, since it is used in ten or more sentences. This makes the list shorter, which feels nice. Measurable progress.
The words with some sentences are less likely to be in the wrong language, misspelled, very obscure, etc., so it is easier to contribute sentences that use them.
You will not be meeting the same words again and again when going through the list, since you will increase the number of sentences that use the word, but will move towards translating words that are less and less used.
Cons: You are not adding as much to the diversity as you could be, at first.
Notes and tips:
All of the following are considered as different "words" for this feature: "horse", "horses", "can lead a horse to". I doubt capital letters matter, but I have not checked this with any rigour. Adding sentences with declined words will still contribute to the corpus and have all the other benefits explained above, but will not be counted by this feature of Tatoeba's user interface.
Simply ignore words that you do not want to add sentences to, for whatever reason. Sometimes I put in a bit of effort to figure out unknown words in my native language, sometimes not.
It is not terribly useful to add phrases that do not give any indication about the meaning of the word. Thus, adding a sentence like "What is teratology?" is not very helpful, as it only suggests that "teratology" is a noun. Likewise, "Jacob likes teratology." is not terribly helpful in this context. "Jacob likes teratology and other oddities." is better.
Similarly, adding variations of a sentence where the word in question has the same role is not very useful in this context. E.g. "Teratology is related to deformities." and "Teratology has to do with deformities." would both be reasonable sentences, but adding both of them as sentences that use the word "teratology" does not reveal more about that word. Thus, it might not be the ideal thing to do. (Adding both of them as translations of a suitable sentence would, on the other hand, be completely fine and a useful thing to do.)
I think it is healthy for a single member to add a sentence or few to any given word, but trying to clear the entire list by oneself might not be helpful. One is likely to start contributing similar sentences to any given word after the initial inspiration is exhausted. Having several members contribute a sentence or two each would be better for diversity. But if there is only a single member who adds sentences in this way for a given language, then so be it.
If it is difficult to come up with a sentence, I often take a look at a small group of words in the list and try to create a sentence that uses two or even three of them. Constraints breed creativity.
If there is a foreign word on the list, one can (but need not) add a sentence about its pronunciation, etymology, register, social position, meaning, language, script, etc. A single sentence like "Lentokonemekaanikko is a Finnish word." might be okay, but it is better to add more interesting sentences, such as "Lentokonemekaanikko is a Finnish word that combines three different words: "lento" means flight, "kone" machine and "mekaanikko" mechanic (as in a profession or person). The word means someone who repairs, maintains or builds aeroplanes."
Please excuse my English, but I hope the idea is clear: If there is a misspelled word, a foreign word, etc., it might be possible to contribute a meaningful sentence that uses it.
Another example: ""Thier" is a typical misspelling of "their"." would be a reasonable sentence, in my opinion. A sentence about why the misspelling is ubiquitous would make an even more interesting sentence, again in my opinion.
Thanks a lot for this.
Whether it was intentional or not, I think you've just made a very good case on why we should put more effort on the vocabulary feature :)
In not a single sentence of your post you have suggested improvement of the vocabulary feature, you have only described how to use it. And yet this post actually makes me really want to improve the feature. It convinces me a hundred times more than regular feature requests that will usually start with "It would be nice if..." or "I think it would be useful if...".
This is a beautiful post and I hope to see more of these.
You are welcome.
My main feature request is for someone to add sentences to the words I have added, but I know it is challenging to implement.
is there a way to delete some "words" from the vocabulary list? Someone added by mistake a couple of sentences...
There is no way to do that right now. :-(
Thanks for the info though :)
A person can delete the words they have added via https://tatoeba.org/swe/vocabulary/of/user (replacing 'user' with their own username). I do not know if it possible to see who has added the sentences there.
The stemming function of the vocabulary feature needs to be improved, I believe.
This is especially important for agglutinative languages like Turkish. I noticed it when I checked Ivanovb's vocabulary list (https://tatoeba.org/eng/vocabulary/of/Ivanovb ). Some of the words he listed have actually more examples in the corpus than their counts on the vocabulary page, but the lack of stemming causes inconsistency.
Search links on the vocabulary page have a preceding equal sign before words, so if it is a verb, sentences containing its other conjugations, and if it is a noun, sentences containing its other forms (singular or plural) won't show up. This precision often doesn't work well and creates inefficiency for learning new vocabulary (at least for Turkish).
same goes for Spanish.
What is your opinion on using the asterisk on Spanish vocabulary items? Would it be useful, or more likely cause confusion?
That would be helpful for nouns and adjectives (we have no cases or declensions in Spanish, only masculine, feminine, singular and plural form), but not always for verbs because we have many irregular verbs for which even the vowels in the stem might change like in hacer (to do) - hago (I do) - hizo (he did).
Thanks for your explanation.
As far as I know, this feature does not currently use stemming at all.
It is not clear if it should. On the other hand, that would prevent the same words appearing multiple times in the wanted sentences -vocabulary, which might be nice, and it will allow for much more diverse sentences even when everyone adds words in infinitive or other fixed form.
On the other hand, maybe someone wants examples of a particular form of a word. At least in Finnish, some words have taken a life of their own in phrases or otherwise and no longer have the same meaning as one would expect, and sometimes there are quite clear distinctions that stemming removes.
> Some of the words he listed have actually more examples in the corpus than their
> counts on the vocabulary page, but the lack of stemming causes inconsistency.
As Thanuir explained above, stemming can cause the reverse effect, that is in some cases, your vocabulary item would not show up in the "sentences wanted" because it would already lots of sentences sentences with other forms (but not the form you specifically requested for).
We have not designed the vocabulary for all possible use cases yet. The only use case we designed for so far is that the user wants sentences that contains exactly what they have added as vocabulary and that having more than 10 sentences with exact match will be satisfying.
That is of course not the reality, but that is where we're starting from.
For stemming I don't fully picture the use case behind it. When you add a verb at the infinitive form for instance, do you really want sentences with any possible conjugations of the verb? And even then, do you really want all sentences of Tatoeba that contains all the possible conjugations of the verb? You would probably just want a small set, a custom list with one or two examples for each form.
My guess is that your suggestion regarding stemming is not just about stemming in itself and simply enabling stemming would not be enough. Perhaps the combination of these issues would actually fulfill the needs behind your suggestion to improve stemming:
In Turkish, words usually end with suffixes and there are dozens of them.
Adding new vocabulary items in nominative or infinitive forms (as in dictionaries) would cause most examples to not appear with the default behavior. For example, if you added the word 'school' to your vocabulary items, you would see examples of 'my school', 'your school', 'to school', 'from school', 'at school' etc. but in Turkish, these are all shown with suffixes, so you would need to add a lot of different forms to see them. If your purpose was to learn new vocabulary rather than studying suffixes, it would create difficulty.
I'm quoting from @Thanuir's reply:
> On the other hand, maybe someone wants examples of a particular form of a word
That's right, but at least there could be some tips on the vocabulary page (similar to ones on the advanced search page), informing users about the precision (and hence limitation) of current design, and possible advantages of using an asterisk if they are not looking for only a particular form of a word. It could be used as a stemmer on many occasions. I had a look at the vocabulary items others added, but never saw one with an asterisk. Most users may not even be aware of it. I don't think they were all interested in only a particular form. It might be the case sometimes, but the other way around is more likely.
I encourage anyone who's interested in adding new vocabulary items with 4 or more letters in Turkish to use an asterisk: 'bare infinitive + *' for verbs, and 'nominative + *' for nouns and possibly for others. If it's a relatively long word ending with the letters p,ç,t, or k, even that last letter before the asterisk can be dropped to get more examples affected by consonant alternation, which is a common phenomenon in Turkish.
To use the vocabulary feature more efficiently, other users can share similar tips about their languages on the Wall, too. What works for one language may not work with another.
I did not know that asterisk could be used. Thanks. Similar tips seem to apply to Finnish as to Turkish.