clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search

Wall (5416 threads)

CK
CK
2 days ago - 2 days ago
I'm looking for volunteers to help me tag English sentences.

I have a large list of sentences that need both the "imperative" tag and the "present simple" tag.

If you think it might be interesting to tag such sentences, while perhaps at the same time translate them, please send me a private message, and I'll send you a list of clickable links to sentences.

https://tatoeba.org/eng/private_messages/write/CK


Note that you need to be an "advanced contributor" (or higher) to tag sentences.
https://en.wiki.tatoeba.org/art...d-contributors
hide replies
Smoky
yesterday
Do you provide free snacks and drinks to volunteers?
soliloquist
yesterday
I thought you were using some bot/script for the 'List 907' tag. It requires a lot of effort to tag that many sentences one by one. I, too, have thousands of sentences that need to be tagged, but it's discouraging having to visit each sentence's page.

Let's hope the mass-tagging feature will be implemented in the future.

https://github.com/Tatoeba/tatoeba2/issues/785
hide replies
Guybrush88
yesterday
I do agree with this, it would be a very useful feature also for me, since from time to time I tag Italian sentences.
hide replies
soliloquist
yesterday
Yes, that would be a great time-saver for you, too. You're the second most active user after CK in terms of tagging. (+182,756)

https://tatoeba.j-langtools.com...art/chart2.php
Ricardo14
yesterday
Me too. I do want to tag many sentences.

For example, whenever a sentence begins by "Eu" in Portuguese it'll be tagged as "1st Person Singular".
Austrália, França, Canadá - country
amo, quero, vivo, estou - presente do indicativo

and so on....
hide replies
CK
CK
15 hours ago
Note that it is possible to search for Portuguese sentences starting with "Eu."

https://tatoeba.org/eng/sentenc...eng&sort=words
Ricardo14
yesterday
I have updated the request on GitHub - https://github.com/Tatoeba/tato...ment-512976729
hide replies
soliloquist
yesterday - yesterday
Thanks!

Edit: I made a suggestion.

https://github.com/Tatoeba/tato...ment-512998686
MacGyver
5 days ago - 4 days ago
If you're looking for words to write example sentences for Tatoeba, then you should look at the arrows '<-- increase'. They indicate the words that should appear more in the Tatoeba Corpus. The second column of numbers indicates how many times that word should appear in Tatoeba in order to have the same frequency (proportionally) it has in the OpenSubtitles.com corpus.

I downloaded a file from http://opus.nlpl.eu/OpenSubtitles-v2018.php with 441.5M sentences in English and wrote a script to create a frequency list of words (the list in unformatted, i.e., don't -> don + t, etc). I did the same thing with all the sentences in English owned by native speakers here in the Tatoeba corpus. After comparing the two frequency lists, I compiled a file that gives you an idea of how the frequency of the words in Tatoeba/English compares to that of the OpenSubtitles/English corpus:

A sample of the file:
word \t occurrences in Tatoeba \t how many there should be (proportionally)
tom 359494 723.9471614662318
i 303688 284673.5980557653
to 294034 174335.06447556426
that 234251 105369.07199955775
the 199956 231733.21163537377 <-- increase
t 197499 101493.03356767741
you 176213 305392.1646249775 <-- increase
mary 142565 625.8323390462842
a 130684 149839.41303818647 <-- increase
do 120631 46119.76310532704
he 106705 57203.45769471371
is 106175 74813.6430558621
and 89045 104790.98523257893 <-- increase
was 76168 43791.06207458618
s 75846 152917.0533331018 <-- increase
in 73907 75330.66885961362 <-- increase
it 70138 140934.43701484663 <-- increase
of 68495 90226.5530729227 <-- increase
she 64731 27836.557998979915
be 60860 43264.01316312008
they 59631 32147.915102217754
me 56818 68010.83329256669 <-- increase
have 53539 48189.83034799129
said 53431 8299.380929768524
we 49383 72168.9096779763 <-- increase
know 49071 41497.24193626721
don 48290 43850.854478915724
for 47362 52364.353226404884 <-- increase
what 46221 73483.70680135512 <-- increase
didn 44793 11179.517051257248
think 42518 19066.90288432038
this 41598 60412.53263081548 <-- increase
are 37926 42973.06544886179 <-- increase
can 37084 39846.78412853941 <-- increase
with 36257 38338.81981074213 <-- increase
his 35673 17064.111051060263
on 35404 51944.97959418119 <-- increase
not 34972 43740.29882476427 <-- increase
her 33423 21676.844915118265
m 32329 48312.26568311476 <-- increase
my 31459 51048.86660395457 <-- increase
want 29463 20022.39637034032
told 28727 5296.513973554256
like 27819 30102.40119012533 <-- increase
at 27399 24364.425735303917
did 27199 17909.338412841887
has 25722 10076.924460296785
going 25540 15317.881523817805
as 24937 17656.581783517522
go 23405 28750.83973746701 <-- increase


the file: https://github.com/sidfc/Langua...ng_ordered.txt

There are also files for: deu, fra, spa, por, ita, pol, rus
hide replies
Thanuir
4 days ago
These all are very common words with huge numbers of sentences. Is there a particular reason for actively adding many more to get the same frequencies as the opensubtitles database?
hide replies
CK
CK
4 days ago
I wondered the same thing.

Here are the first 100 words with "<-- increase".

the, you, a, and, in, it, of, me, we, for, what, this, are, can, with, on, not, my, like, go, him, your, there, if, about, here, all, one, get, out, up, from, good, just, but, no, them, an, so, let, now, more, say, got, where, see, come, back, some, too, something, take, people, right, make, our, way, or, well, into, please, look, give, over, off, find, new, must, little, other, put, first, after, down, love, old, years, things, night, am, even, believe, man, two, life, away, being, nothing, came, wrong, these, father, understand, feel, looking, wait, stop, because, thing, call

Likely it would be more useful to find words that are high on word frequency lists that are missing from the Tatoeba Corpus. Perhaps you could generate such lists, putting the words in frequency order.


hide replies
Objectivesea
4 days ago
CK wrote: “Likely it would be more useful to find words that are high on word frequency lists that are missing from the Tatoeba Corpus. Perhaps you could generate such lists, putting the words in frequency order.”

I strongly agree with this suggestion. There are various frequency dictionaries for individual languages published by Routledge or by the Leipziger Universitätsverlag. Typically, these dictionaries list a large number of words (with definitions to prevent confusion with respect to homographs that have distinct meanings) — 5,000 or 10,000 or so. It would be nice to find words on the Routledge or Leipzig lists that are not yet found in the Tatoeba database for that particular language. These missing words could then be arranged in frequency order based on one of these published frequency dictionaries.

As CK notes, the order of the first 100 words or so is not very significant; the frequency depends to a great degree on the particular database selected — whether words are taken from newspaper text, from fiction works, from scientific papers, or transcribed from oral conversations. Indeed, when compiling concordances to works like the Bible or Shakespeare, etc., the most frequent 100 or so words in that corpus are placed on a “stop list” to be ignored by the computer preparing the concordance.

When learning a language, however, it can be very helpful to prioritize the most common words. Thus, even knowing 1,000 or 2,000 words can dramatically boost one's ability to understand that language and to speak fluently. Because Esperanto has extremely regular word-formation rules, a vocabulary of as few as 600 or 700 Esperanto words can be the equivalent of knowing 2,000 words in German, French, Russian or Spanish, etc.

An interesting article (at https://glanier.wordpress.com/2...arning-greek/) points out that introductory Greek courses often focus on the 310 most frequent words encountered in the New Testament, which enables a student “to read 80% of the NT without using a dictionary.”

If our user @MacGyver were able to generate, say, lists of the most common words (frequency order from 100 to 1,000) for English, Italian, Russian, Turkish and Esperanto — the five languages at Tatoeba which currently have the most sentences each) and compare the lists with the Tatoeba database, we would learn which particular high-frequency words are underrepresented in the Tatoeba database. Then contributors motivated to create sentences could try to focus on sentences utilizing those words. I think this might greatly improve the utlity of Tatoeba to language learners using the strategy of first learning the most frequently spoken words.
hide replies
Thanuir
4 days ago
Quite unrelated, but I had to check what a concordance or compiling one means. Could you add a sentence or two to this effect to Tatoeba? These would be precisely the kind of material that an advanced learner finds useful.

I also added "compile" and "concordance" to my vocabulary.
MacGyver
4 days ago - 4 days ago
I got a few lists of words online and compiled the following files:

A list with the top ~3k words ordered by frequency of occurrence in the OpSub corpus: https://github.com/sidfc/Langua...atoeba_v01.txt

A list with the top ~10k words ordered by frequency of occurrence in the OpSub corpus: https://github.com/sidfc/Langua...atoeba_v01.txt

They are organized as follows:
column 1 = the word
column 2 = occurrences of the word in OpenSubtitles.com (the file is ordered by this column)
column 3 = occurrences of the word in Tatoeba (only sentences by native users were considered)
column 4 = indicates how many times the word 'should' appear in Tatoeba
column 5 = it only shows for words that have less than 50% of the occurrences it 'should have' in Tatoeba

As an example, the words 'indistinct', 'limitation', 'restriction', 'annulment', 'inaudible', 'flare', 'abduction', 'depot', 'decoy', 'deposition', 'cheater', 'retainer', 'hypothetically', 'caress', 'rebound', 'sleepover', 'riddance', 'relive', 'proxy', 'onward', 'visitation', 'envoy', 'reptile', 'viewer', 'proclaim', 'retrieval', 'canvass', 'caterer', 'abduct', 'withhold', have ZERO occurrences in this site (considering only sentences by natives).

I only use OpSub to measure the frequency of words, not as a source of words (there are too many wrong words in there). So, there isn't much I can do in order to generate a good/useful list of most frequent words (in any language).
MacGyver
4 days ago - 4 days ago
* UPDATE *

Now using data from the British National Corpus.

A list with the top 30k words ordered by frequency of occurrence in the BNC corpus: https://github.com/sidfc/Langua...atoeba_v01.txt

You need to look at the second column of numbers (third column from left to right) to find words that have a low number of occurrences in Tatoeba.
hide replies
Ricardo14
yesterday
MacGyver - Would that be possible to generate a list of words in Portuguese that were not "posted" on Tatoeba?
sharptoothed
3 days ago
** Stats & Graphs **

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/
hide replies
CK
CK
3 days ago
With so many non-native submissions last week, the "Count only native contributions" option is worth clicking. https://tatoeba.j-langtools.com/userchart/?nonly=1
deniko
3 days ago
Is it only me, or the site seems to be down?
hide replies
sharptoothed
3 days ago
I see no problem.
hide replies
deniko
3 days ago - 3 days ago
Weird, this is what I'm seeing trying to open it:

https://i.imgur.com/J9Q312c.png

Might be our firewall, of course, but everything else seems to be working fine though.

Obviously, I see the same when I go to the main page:

https://tatoeba.j-langtools.com/allstats/
hide replies
sharptoothed
3 days ago - 3 days ago
Maybe your ISP experiences some connectivity problem. Try running tracert / traceroute utility from your computer. You should see something like this:
https://2whois.ru/?t=traceroute...-langtools.com
hide replies
deniko
3 days ago
tracert doesn't seem to work from my computer - probably, again, because of some proxy settings.

It turned out I can open it from my phone just fine, so it does seem like a problem with my proxy server.

Guybrush88
yesterday
Thanks
hide replies
sharptoothed
yesterday
You're welcome :-)
maaster
14 days ago
I've finished my contribution on Tatoeba because of colleague mraz - as others finished it as well.
If he continues to unlink my Hungarian-Hungarian sentence pairs with the same meaning, I'll systematically delete all my translations, all my sentences.

I wrote about the problem months ago, nothing happened. Since then many of Hungarian members gave up Tatoeba.
hide replies
Pfirsichbaeumchen
13 days ago
Sent a private message.
hide replies
mraz
13 days ago - 13 days ago
hide replies
Pandaa
12 days ago - 12 days ago
Pandaa
13 days ago
Megkérdezhetném, hogy mi ez a perpatvar köztetek?
hide replies
maaster
5 days ago - 4 days ago
Ez nem csak kettőnk között. Csak a többi látszólag megelégelve az egészet szép csöndben távozott okos enged alapon - én a szamár szerepét választottam.
Thanuir
13 days ago
Olisi suuri vahinko tietokannalle, jos poistaisit lauseesi. Toivottavasti päädytte jonkinlaiseen sopuun, tai jos päädyt poistumaan, niin tilin sulkeminen ja sähköpostimuistutusten lopettaminen riittää.

...

Ehdottaisin aselepoa, eli että ette linkitä tai poista linkityksiä toistenne lauseista. Olettaisin ongelmien johtuvan eriävistä tulkinnoista koskien linkityksen merkitystä. Olette toivottavasti yrittäneet keskustella asiasta jo. Kenties joku muu voisi toimia sovittelijana asiassa?
hide replies
AlanF_US
13 days ago
Pfirsichbaeumchen said that she sent a private message, so I assume she's dealing with the situation. I hope it can be resolved to everyone's satisfaction.
hide replies
Objectivesea
4 days ago
I echo the comment of AlanF_US. Sometimes two or more really bright people can accidentally "rub each other the wrong way," like bamboo stems rubbing against each other in the forest, and give rise to an unintended fire. Let's all do our best to reduce friction and also try to help make Tatoeba continue to grow as an innovative help for language learners all over the world. The contributions of many can overcome the limitations of a few. Please let's not allow a temporary irritation with one or two contributors to reduce the great utility of the overall project. Working together, I know that we can make Tatoeba better and better.
jegaevi
4 days ago
Kérlek, ne töröld a mondataidat! Olyan nagy kár értük! Ez a passzív agresszió nem vezet semmire. Nem lehetne ezt a dolgot valahogy megoldani? Nagyon sajnálnám, ha itt hagynád a Tatoebát.
hide replies
maaster
4 days ago - 4 days ago
Amelyek szétkapcsolásra kerülnek, azokat törlöm, mert nehezen érthető mondat, nem gyakori kifejezésmóddal íródott az egyik tag, hogy az ilyeneket is megismerhessék azok, akiket esetleg érdekel a magyar nyelv (mert a Tom Bostonban él típusú mondatok fordításai ezt nem teszik lehetővé) , és mellérendelve van egy könnyen emészthető, magyarázó mondat, így, szétkapcsolás után, értelmét veszti az egész.
Az az igazság, én is sajnálom őket, de sajnos nem a gondolkodás, hanem a formaság kerekedett felül.
Nagyjából ezt vártam volna el mástól is: pl. ha vki ír egy szólásfélét, akkor értelmezi azt, hogy legyen értelme a T. használatának.
Mert így többet kell keresgélni a Guglival, mint amennyit a T.-t használod.
(A solution could be chemotherapy, but I'm afraid it's too late.)

Nehéz leszokni, mint dohányosnak lehet a cigiről. Már 2x +próbáltam, de visszaeső vagyok. Nem látom értelmét továbbcsinálni, csak az időm megy rá.
hide replies
Pandaa
4 days ago - 4 days ago
Nem értem, miért kellett szétkapcsolni.
Hisz példát is adtam rá, hogy létezik ilyen még C* mondattárában is.
#7472853
#6226100

* x, y, z... stb.
CK
CK
4 days ago - 4 days ago
English Vocabulary Study (With Links to Tatoeba.org)

http://tatoeba.byethost3.com/vocab/

This is something I put together in October of 2016.

Older members may not remember this and new members may not have see it yet.

Ricardo14
5 days ago
Would you guys like to have a group on Telegram?
Telegram is really great and easy to use. Besides, it prevents users to know your phone number.
hide replies
odexed
4 days ago
Good idea, you could create it and put a link here.
hide replies
CK
CK
5 days ago - 5 days ago
Ther are over 18,000 English sentences with audio that have no translations

Sort: Last Created
https://tatoeba.org/eng/sentenc...e&sort=created

Sort: Random
https://tatoeba.org/eng/sentenc...de&sort=random

Perhaps you would enjoy translating some of these into your own native language.

18,529 out of 433,558 (4.27%) had no translations on July 14, 2019 at 9:00 UTC.

If you want to see the sentences that have the most-recently uploaded English audio files, then you can browse my list at http://tatoeba.org/eng/sentence...direction=desc . The newest audio files are at the top.
hide replies
CK
CK
6 days ago
A Selection of English Sentences with 20 or More Alternative Translations in One Language

http://tatoeba.byethost3.com/al...019-07-13.html

I just did this for fun.
mmorelfmc
9 days ago - 9 days ago
Here is a situation where I would need the more experienced users to propose how to resolve it.

The problem is with Sentence #13214:

< Bin, aimes-tu le baseball ? >

The author used "Bin" which is an expression from Québec:

http://www.je-parle-quebecois.c...n/ben-bin.html

(Also the pronunciation of "in" in French doesn't really have a equivalent in English or Esperanto [I don't know about other languages]. It is not "bin" like in "storage bin". See
https://french.stackexchange.co...-a-back-vowel)

Simply said, it means in English, "Well", like in:

< Well, do you like baseball? >

or in Esperanto, "Nu", like in:

< Nu, ĉu vi ŝatas la basbalon? >

or in more international French, "Alors", like in:

< Alors, aimes-tu le baseball ? >

However, everyone who translated the original sentence assumed that "Bin" was the name of a person. This resulted in some very silly sentences like:

Do you like baseball, Bin?
Bin, houd je van honkbal?
¿Te gusta el béisbol, Bin?
Ĉu vi ŝatas la basbalon, Bin?

and with the Galician sentence, "Bin" was even replaced by "Bill".

How do we go about fixing this mass hallucination?

hide replies
soweli_Elepanto
9 days ago
Personally I think that no need to do anything about it. Those "wrong" translations indeed may be as good as the "right" ones. Your case is not unique. For example, the name "Tom", when translated _from_ Russian, can be understood as "volume", and "Tom's" can be understood not only as "of volume", but as "volumes" as well.
hide replies
mmorelfmc
9 days ago
What happens with cases like the "Tom-confused-as-volume"? If an original English sentence is about Tom but the Russian translation makes it about volume, then the translation is false. Is there a mechanism (perhaps a tag? a marker?) to indicate that the meaning of a translated sentence differs significantly from the original sentence?
hide replies
Impersonator
8 days ago
> If an original English sentence
> is about Tom but the Russian translation
> makes it about volume, then
> the translation is false.

Russian would be about *both* Tom and volume. The word 'tom' means 'volume' in Russian, and if 'volume' is placed at the beginning of the sentence, then it will be written with a capital letter.

If Russian is only about a volume, then it's a mistake and it should be unlinked from the translation.
Aiji
8 days ago
Considering "Bin" as a name or as the equivalent to "Ben" would both give correct translations. Of course, one could argue that "Bin" is a silly name, etc. but basically I don't think there is anything wrong (as long as the misunderstanding is not systematic)

We can also use tags, for example we have a "français du Canada" tag. Although nobody would read it when translating because it wouldn't be displayed on a results list.

And finally, one can leave a comment to say that Bin = Ben.
Objectivesea
8 days ago
These sorts of misapprehensions can often be prevented by supplying a usage note in a comment with an unusual word's actual meaning when creating the original sentence.
hide replies
mmorelfmc
7 days ago
Mia amiko, vi parolas la vero.
Thanuir
7 days ago
Contribute the translations you deem as correct. A sentence can mean several different things.
deniko
8 days ago
Tag auto-completion.

A while ago I made a mistake and tagged a sentence with the tag "matheamatics" instead of "mathematics". I immediately removed that tag, and added the correct one, but now every time I start typing "math" to add a tag, "matheamatics" is on the list:

https://i.imgur.com/g7Zkwq8.png

There are currently no sentences for the tag "matheamatics".

https://tatoeba.org/eng/tags/sh..._tag/10425/eng

Can auto-completion be modified not to suggests tags with zero sentences?
hide replies
Guybrush88
8 days ago