clear
{{language.name}} Neniu lingvo trovita.
swap_horiz
{{language.name}} Neniu lingvo trovita.
search

Muro (5459 fadenoj)

gillux
antaŭ 29 tagoj
Recently I’ve been working on getting rid of the 100 sentences limitation for list downloads. Currently, only lists having less than 100 sentences can be downloaded as CSV.

On our dev website, it’s possible to download any list: https://dev.tatoeba.org/fra/sentences_lists/index Click on any list and then click on the "download this list" button on the bottom right of the page. While the download page looks the same, under the hood it works differently and the download takes a bit more time to start up.

Feedback is welcome.
kaŝi la respondojn
CK
CK
antaŭ 29 tagoj - antaŭ 29 tagoj
I was able to easily download a 500+ sentence list of German sentences with sentence numbers and English translations.

It took less than 2 minutes.

I also tried downloading List 907 with sentence numbers and Japanese matches.

This took about 25 minutes and resulted in a 128,557 line file (Less than 20% of this over 710,000 sentence list had Japanese translations).

I also downloaded List 4000 (sentences with my audio files) with no sentence numbers and no translations. It took less than a minute for this list of 398,631 lines.

kaŝi la respondojn
gillux
antaŭ 29 tagoj
I see, it looks like selecting a translation dramatically slows down the generation of the CSV file. I’ll try to optimize this.
kaŝi la respondojn
AlanF_US
antaŭ 28 tagoj
I tried downloading List 907 without translations, and it took less than a minute.
CK
CK
antaŭ 29 tagoj
** Random English Sentence with Audio **

http://www.manythings.org/tatoeba/eng/randomaudio/

Click the "next random sentence" and get one of over 19,000 English sentences with audio. This selection of sentences were all the ones without any translations on 2019-07-20.

Aiji
2019-07-20 23:23 - 2019-07-20 23:25
UX-ly speaking (yes, I'm creative word-maker), I think that would be a good idea to add a question-mark pop-up on the right of the "Order" select box of the advanced search to explain what this mysterious "Relevance" is about. Right now, that's pretty much 100% unclear.
kaŝi la respondojn
AlanF_US
2019-07-21 01:36
I agree.
TRANG
2019-07-23 18:11
CK
CK
2019-07-21 05:41
** Stats - 2019-07-20 - The Number of English Sentences on List 907 That Have Translations by Native Speakers **

http://tatoeba.byethost3.com/stats-190720.html

The last time I generated these stats was in 2017.
CK
CK
2019-07-17 08:44 - 2019-07-17 08:46
I'm looking for volunteers to help me tag English sentences.

I have a large list of sentences that need both the "imperative" tag and the "present simple" tag.

If you think it might be interesting to tag such sentences, while perhaps at the same time translate them, please send me a private message, and I'll send you a list of clickable links to sentences.

https://tatoeba.org/eng/private_messages/write/CK


Note that you need to be an "advanced contributor" (or higher) to tag sentences.
https://en.wiki.tatoeba.org/art...d-contributors
kaŝi la respondojn
Smoky
2019-07-18 10:47
Do you provide free snacks and drinks to volunteers?
soliloquist
2019-07-18 14:02
I thought you were using some bot/script for the 'List 907' tag. It requires a lot of effort to tag that many sentences one by one. I, too, have thousands of sentences that need to be tagged, but it's discouraging having to visit each sentence's page.

Let's hope the mass-tagging feature will be implemented in the future.

https://github.com/Tatoeba/tatoeba2/issues/785
kaŝi la respondojn
Guybrush88
2019-07-18 16:50
I do agree with this, it would be a very useful feature also for me, since from time to time I tag Italian sentences.
kaŝi la respondojn
soliloquist
2019-07-18 18:29
Yes, that would be a great time-saver for you, too. You're the second most active user after CK in terms of tagging. (+182,756)

https://tatoeba.j-langtools.com...art/chart2.php
Ricardo14
2019-07-18 20:30
Me too. I do want to tag many sentences.

For example, whenever a sentence begins by "Eu" in Portuguese it'll be tagged as "1st Person Singular".
Austrália, França, Canadá - country
amo, quero, vivo, estou - presente do indicativo

and so on....
Ricardo14
2019-07-18 20:35
I have updated the request on GitHub - https://github.com/Tatoeba/tato...ment-512976729
kaŝi la respondojn
soliloquist
2019-07-18 20:50 - 2019-07-18 21:42
Thanks!

Edit: I made a suggestion.

https://github.com/Tatoeba/tato...ment-512998686
MacGyver
2019-07-14 17:37 - 2019-07-15 07:27
If you're looking for words to write example sentences for Tatoeba, then you should look at the arrows '<-- increase'. They indicate the words that should appear more in the Tatoeba Corpus. The second column of numbers indicates how many times that word should appear in Tatoeba in order to have the same frequency (proportionally) it has in the OpenSubtitles.com corpus.

I downloaded a file from http://opus.nlpl.eu/OpenSubtitles-v2018.php with 441.5M sentences in English and wrote a script to create a frequency list of words (the list in unformatted, i.e., don't -> don + t, etc). I did the same thing with all the sentences in English owned by native speakers here in the Tatoeba corpus. After comparing the two frequency lists, I compiled a file that gives you an idea of how the frequency of the words in Tatoeba/English compares to that of the OpenSubtitles/English corpus:

A sample of the file:
word \t occurrences in Tatoeba \t how many there should be (proportionally)
tom 359494 723.9471614662318
i 303688 284673.5980557653
to 294034 174335.06447556426
that 234251 105369.07199955775
the 199956 231733.21163537377 <-- increase
t 197499 101493.03356767741
you 176213 305392.1646249775 <-- increase
mary 142565 625.8323390462842
a 130684 149839.41303818647 <-- increase
do 120631 46119.76310532704
he 106705 57203.45769471371
is 106175 74813.6430558621
and 89045 104790.98523257893 <-- increase
was 76168 43791.06207458618
s 75846 152917.0533331018 <-- increase
in 73907 75330.66885961362 <-- increase
it 70138 140934.43701484663 <-- increase
of 68495 90226.5530729227 <-- increase
she 64731 27836.557998979915
be 60860 43264.01316312008
they 59631 32147.915102217754
me 56818 68010.83329256669 <-- increase
have 53539 48189.83034799129
said 53431 8299.380929768524
we 49383 72168.9096779763 <-- increase
know 49071 41497.24193626721
don 48290 43850.854478915724
for 47362 52364.353226404884 <-- increase
what 46221 73483.70680135512 <-- increase
didn 44793 11179.517051257248
think 42518 19066.90288432038
this 41598 60412.53263081548 <-- increase
are 37926 42973.06544886179 <-- increase
can 37084 39846.78412853941 <-- increase
with 36257 38338.81981074213 <-- increase
his 35673 17064.111051060263
on 35404 51944.97959418119 <-- increase
not 34972 43740.29882476427 <-- increase
her 33423 21676.844915118265
m 32329 48312.26568311476 <-- increase
my 31459 51048.86660395457 <-- increase
want 29463 20022.39637034032
told 28727 5296.513973554256
like 27819 30102.40119012533 <-- increase
at 27399 24364.425735303917
did 27199 17909.338412841887
has 25722 10076.924460296785
going 25540 15317.881523817805
as 24937 17656.581783517522
go 23405 28750.83973746701 <-- increase


the file: https://github.com/sidfc/Langua...ng_ordered.txt

There are also files for: deu, fra, spa, por, ita, pol, rus
kaŝi la respondojn
Thanuir
2019-07-15 07:39
These all are very common words with huge numbers of sentences. Is there a particular reason for actively adding many more to get the same frequencies as the opensubtitles database?
kaŝi la respondojn
CK
CK
2019-07-15 07:58
I wondered the same thing.

Here are the first 100 words with "<-- increase".

the, you, a, and, in, it, of, me, we, for, what, this, are, can, with, on, not, my, like, go, him, your, there, if, about, here, all, one, get, out, up, from, good, just, but, no, them, an, so, let, now, more, say, got, where, see, come, back, some, too, something, take, people, right, make, our, way, or, well, into, please, look, give, over, off, find, new, must, little, other, put, first, after, down, love, old, years, things, night, am, even, believe, man, two, life, away, being, nothing, came, wrong, these, father, understand, feel, looking, wait, stop, because, thing, call

Likely it would be more useful to find words that are high on word frequency lists that are missing from the Tatoeba Corpus. Perhaps you could generate such lists, putting the words in frequency order.


kaŝi la respondojn
Objectivesea
2019-07-15 09:12
CK wrote: “Likely it would be more useful to find words that are high on word frequency lists that are missing from the Tatoeba Corpus. Perhaps you could generate such lists, putting the words in frequency order.”

I strongly agree with this suggestion. There are various frequency dictionaries for individual languages published by Routledge or by the Leipziger Universitätsverlag. Typically, these dictionaries list a large number of words (with definitions to prevent confusion with respect to homographs that have distinct meanings) — 5,000 or 10,000 or so. It would be nice to find words on the Routledge or Leipzig lists that are not yet found in the Tatoeba database for that particular language. These missing words could then be arranged in frequency order based on one of these published frequency dictionaries.

As CK notes, the order of the first 100 words or so is not very significant; the frequency depends to a great degree on the particular database selected — whether words are taken from newspaper text, from fiction works, from scientific papers, or transcribed from oral conversations. Indeed, when compiling concordances to works like the Bible or Shakespeare, etc., the most frequent 100 or so words in that corpus are placed on a “stop list” to be ignored by the computer preparing the concordance.

When learning a language, however, it can be very helpful to prioritize the most common words. Thus, even knowing 1,000 or 2,000 words can dramatically boost one's ability to understand that language and to speak fluently. Because Esperanto has extremely regular word-formation rules, a vocabulary of as few as 600 or 700 Esperanto words can be the equivalent of knowing 2,000 words in German, French, Russian or Spanish, etc.

An interesting article (at https://glanier.wordpress.com/2...arning-greek/) points out that introductory Greek courses often focus on the 310 most frequent words encountered in the New Testament, which enables a student “to read 80% of the NT without using a dictionary.”

If our user @MacGyver were able to generate, say, lists of the most common words (frequency order from 100 to 1,000) for English, Italian, Russian, Turkish and Esperanto — the five languages at Tatoeba which currently have the most sentences each) and compare the lists with the Tatoeba database, we would learn which particular high-frequency words are underrepresented in the Tatoeba database. Then contributors motivated to create sentences could try to focus on sentences utilizing those words. I think this might greatly improve the utlity of Tatoeba to language learners using the strategy of first learning the most frequently spoken words.
kaŝi la respondojn
Thanuir
2019-07-15 11:25
Quite unrelated, but I had to check what a concordance or compiling one means. Could you add a sentence or two to this effect to Tatoeba? These would be precisely the kind of material that an advanced learner finds useful.

I also added "compile" and "concordance" to my vocabulary.
MacGyver
2019-07-15 17:10 - 2019-07-15 17:31
I got a few lists of words online and compiled the following files:

A list with the top ~3k words ordered by frequency of occurrence in the OpSub corpus: https://github.com/sidfc/Langua...atoeba_v01.txt

A list with the top ~10k words ordered by frequency of occurrence in the OpSub corpus: https://github.com/sidfc/Langua...atoeba_v01.txt

They are organized as follows:
column 1 = the word
column 2 = occurrences of the word in OpenSubtitles.com (the file is ordered by this column)
column 3 = occurrences of the word in Tatoeba (only sentences by native users were considered)
column 4 = indicates how many times the word 'should' appear in Tatoeba
column 5 = it only shows for words that have less than 50% of the occurrences it 'should have' in Tatoeba

As an example, the words 'indistinct', 'limitation', 'restriction', 'annulment', 'inaudible', 'flare', 'abduction', 'depot', 'decoy', 'deposition', 'cheater', 'retainer', 'hypothetically', 'caress', 'rebound', 'sleepover', 'riddance', 'relive', 'proxy', 'onward', 'visitation', 'envoy', 'reptile', 'viewer', 'proclaim', 'retrieval', 'canvass', 'caterer', 'abduct', 'withhold', have ZERO occurrences in this site (considering only sentences by natives).

I only use OpSub to measure the frequency of words, not as a source of words (there are too many wrong words in there). So, there isn't much I can do in order to generate a good/useful list of most frequent words (in any language).
MacGyver
2019-07-15 20:40 - 2019-07-15 20:55
* UPDATE *

Now using data from the British National Corpus.

A list with the top 30k words ordered by frequency of occurrence in the BNC corpus: https://github.com/sidfc/Langua...atoeba_v01.txt

You need to look at the second column of numbers (third column from left to right) to find words that have a low number of occurrences in Tatoeba.
kaŝi la respondojn
Ricardo14
2019-07-18 20:38
MacGyver - Would that be possible to generate a list of words in Portuguese that were not "posted" on Tatoeba?
sharptoothed
2019-07-16 06:55
** Stats & Graphs **

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/
kaŝi la respondojn
deniko
2019-07-16 08:05
Is it only me, or the site seems to be down?
kaŝi la respondojn
sharptoothed
2019-07-16 08:26
I see no problem.
kaŝi la respondojn
deniko
2019-07-16 08:28 - 2019-07-16 08:29
Weird, this is what I'm seeing trying to open it:

https://i.imgur.com/J9Q312c.png

Might be our firewall, of course, but everything else seems to be working fine though.

Obviously, I see the same when I go to the main page:

https://tatoeba.j-langtools.com/allstats/
kaŝi la respondojn
sharptoothed
2019-07-16 08:35 - 2019-07-16 08:35
Maybe your ISP experiences some connectivity problem. Try running tracert / traceroute utility from your computer. You should see something like this:
https://2whois.ru/?t=traceroute...-langtools.com
kaŝi la respondojn
deniko
2019-07-16 08:40
tracert doesn't seem to work from my computer - probably, again, because of some proxy settings.

It turned out I can open it from my phone just fine, so it does seem like a problem with my proxy server.

Guybrush88
2019-07-18 16:50
Thanks
kaŝi la respondojn
sharptoothed
2019-07-18 18:22
You're welcome :-)
maaster
2019-07-05 20:40
I've finished my contribution on Tatoeba because of colleague mraz - as others finished it as well.
If he continues to unlink my Hungarian-Hungarian sentence pairs with the same meaning, I'll systematically delete all my translations, all my sentences.

I wrote about the problem months ago, nothing happened. Since then many of Hungarian members gave up Tatoeba.
kaŝi la respondojn
Pfirsichbaeumchen
2019-07-05 23:27
Sent a private message.
kaŝi la respondojn
mraz
2019-07-06 06:28 - 2019-07-06 06:36
kaŝi la respondojn
Pandaa
2019-07-07 13:23 - 2019-07-07 13:27
Pandaa
2019-07-06 08:14
Megkérdezhetném, hogy mi ez a perpatvar köztetek?
kaŝi la respondojn
maaster
2019-07-14 19:42 - 2019-07-15 19:29
Ez nem csak kettőnk között. Csak a többi látszólag megelégelve az egészet szép csöndben távozott okos enged alapon - én a szamár szerepét választottam.
Thanuir
2019-07-06 15:29
Olisi suuri vahinko tietokannalle, jos poistaisit lauseesi. Toivottavasti päädytte jonkinlaiseen sopuun, tai jos päädyt poistumaan, niin tilin sulkeminen ja sähköpostimuistutusten lopettaminen riittää.

...

Ehdottaisin aselepoa, eli että ette linkitä tai poista linkityksiä toistenne lauseista. Olettaisin ongelmien johtuvan eriävistä tulkinnoista koskien linkityksen merkitystä. Olette toivottavasti yrittäneet keskustella asiasta jo. Kenties joku muu voisi toimia sovittelijana asiassa?
kaŝi la respondojn
AlanF_US
2019-07-06 16:16
Pfirsichbaeumchen said that she sent a private message, so I assume she's dealing with the situation. I hope it can be resolved to everyone's satisfaction.
kaŝi la respondojn
Objectivesea
2019-07-15 22:33
I echo the comment of AlanF_US. Sometimes two or more really bright people can accidentally "rub each other the wrong way," like bamboo stems rubbing against each other in the forest, and give rise to an unintended fire. Let's all do our best to reduce friction and also try to help make Tatoeba continue to grow as an innovative help for language learners all over the world. The contributions of many can overcome the limitations of a few. Please let's not allow a temporary irritation with one or two contributors to reduce the great utility of the overall project. Working together, I know that we can make Tatoeba better and better.
jegaevi
2019-07-15 18:26
Kérlek, ne töröld a mondataidat! Olyan nagy kár értük! Ez a passzív agresszió nem vezet semmire. Nem lehetne ezt a dolgot valahogy megoldani? Nagyon sajnálnám, ha itt hagynád a Tatoebát.
kaŝi la respondojn
maaster
2019-07-15 19:11 - 2019-07-15 19:39
Amelyek szétkapcsolásra kerülnek, azokat törlöm, mert nehezen érthető mondat, nem gyakori kifejezésmóddal íródott az egyik tag, hogy az ilyeneket is megismerhessék azok, akiket esetleg érdekel a magyar nyelv (mert a Tom Bostonban él típusú mondatok fordításai ezt nem teszik lehetővé) , és mellérendelve van egy könnyen emészthető, magyarázó mondat, így, szétkapcsolás után, értelmét veszti az egész.
Az az igazság, én is sajnálom őket, de sajnos nem a gondolkodás, hanem a formaság kerekedett felül.
Nagyjából ezt vártam volna el mástól is: pl. ha vki ír egy szólásfélét, akkor értelmezi azt, hogy legyen értelme a T. használatának.
Mert így többet kell keresgélni a Guglival, mint amennyit a T.-t használod.
(A solution could be chemotherapy, but I'm afraid it's too late.)

Nehéz leszokni, mint dohányosnak lehet a cigiről. Már 2x +próbáltam, de visszaeső vagyok. Nem látom értelmét továbbcsinálni, csak az időm megy rá.
kaŝi la respondojn
Pandaa
2019-07-15 19:29 - 2019-07-15 19:30
Nem értem, miért kellett szétkapcsolni.
Hisz példát is adtam rá, hogy létezik ilyen még C* mondattárában is.
#7472853
#6226100

* x, y, z... stb.
CK
CK
2019-07-15 11:25 - 2019-07-15 11:31
English Vocabulary Study (With Links to Tatoeba.org)

http://tatoeba.byethost3.com/vocab/

This is something I put together in October of 2016.

Older members may not remember this and new members may not have see it yet.

Ricardo14
2019-07-14 15:31
Would you guys like to have a group on Telegram?
Telegram is really great and easy to use. Besides, it prevents users to know your phone number.
kaŝi la respondojn
odexed
2019-07-15 01:35
Good idea, you could create it and put a link here.
kaŝi la respondojn