Wall (6,249 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
3 hours ago
13 hours ago
15 hours ago
16 hours ago
18 hours ago
3 days ago
4 days ago
The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.
I need to let you all know that my heart fills with an overwhelming amount of joy every time I see a new translation of the stuff I've submitted. It creates this child-like excitement within me because it allows me to imagine communicating with a patient in their native language!
So, thank you for helping me and everyone else in healthcare who will benefit from your selfless contributions!
And, my most humble gratitude is for all of you who have found my errors and made suggestions for corrections.
I appreciate you all so much!
Your medical sentences are really useful for people all over the world.
Thank you, too, for submitting that. :)
Thank you for writing sentences with interesting specialist vocabulary.
** Audio by CK **
I reached the 555,555 mark for audio files that I've added to English sentences at 04:25 UTC on 2021-03-20,
You can find the sentences with the most-recently added audio files at the top of this list.
30,705 (5.5%) of these sentences do not yet have translations into any language.
Perhaps you would like to be the first one to translate some of these.
Sort: Last Created
Also, here is a page that I created a while back that some newer members may not yet know about, and some older members may have forgotten.
Translate English Sentences with Audio Using Pre-determined Searches
@CK Congratulations! Thanks you for your titanic working!
+1 You make English closer to all the people!
Tatominer (a weekly ranking of the most underused expressions on Tatoeba) has been moved to https://tatominer.netlify.app
Thanks for letting us know. I believe that you determine these words by looking at how many times they're specified in a search, right? Could you give us a general idea of that range of frequencies? Are searches for many of these words performed more than once? Also, am I right in thinking that you explicitly filter out bad searches (misspellings, mismatched languages)?
I don't have all these frequencies at hand but I can tell you for example that regarding English, the expressions currently listed have been searched at least ten times over a period of one year. For French, it is at least twice.
Of course, in order to be usable, the content of the search log must be cleaned beforehand using spell checkers and dictionaries. I also try to remove queries that seem to be generated by bots.
I added sentences for the English words that had no occurrences (though I know that they won't come off the list until the next time it's updated).
Thank you for improving the coverage of English vocabulary.
I hope that one day the Tatominer features will be integrated into the Tatoeba site, which would allow to update lists of underused expressions in real time.
bumping, but also asking where you got the search data from, since i can't find it on the downloads page or really anywhere
does an admin send the files to you specifically or
The query data from April 2020 is available here: https://downloads.tatoeba.org/s...ueries.csv.bz2 (Directory listings are enabled on downloads.tatoeba.org, so if you want to find some data that's not explicitly advertised, you can just go explore for a bit.)
But I think lbdx also has some privileged access to get more up-to-date search logs.
Queries made 1,000 or more times.(1,356)
The first number is the the count, followed by the language code, followed by the query.
Does continuing to the next page of search results count as another query?
Here is a zip file of all the queries made 2 or more times.
http://tatoeba.ueuo.com/queries...2021-03-11.zip (13.1 MB)
(There were slightly over 100 done only once.)
Tatominer is based on the public dump file mentioned by Yorwba. I don't have any privileged access to the Tatoeba database.
gillux generously generated it at my request about a year ago. By the way, as this file contains annual data, it would be nice to update it in the next few weeks.
The reason I thought you had access to more up-to-date search data is this notice in the sidebar:
> Expressions which appear for the first time in the " list are shown in bold type.
(stray quote, by the way)
And there are new bolded entries every week. I thought it might just be because new sentences were added, but the current list for Mandarin Chinese to German https://tatominer.netlify.app/cmn-deu.html contains 以免 in bold, and most of the search results are sentences by users who are no longer active: https://tatoeba.org/cmn/sentenc...ns_link=direct
When a word listed by Tatominer finally has at least 3 example sentences, it is removed from the ranking and replaced by the next most searched word in the same language that has less than 3 sentences. This incoming phrase appears in bold during its first week in the ranking.
Well, the current ranking https://tatominer.netlify.app/cmn-deu.html includes 保障 in bold, and there are 5 Chinese sentences containing it: https://tatoeba.org/cmn/sentenc...9%9A%9C%22&to=
#9839395 is new, but #397099 , #398923 , #3383957 and #472968 were added in 2010, 2010, 2014 and 2010, respectively, and the last one had a German translation added by me in 2020.
So I think this word should've appeared in an earlier ranking and should have its translation count listed as 1, not 0.
For a word to be in the list of words to be translated, it must appear in at least three Tatoeba sentences.
According to Jieba (the word segmenter I use for Chinese), the sentences you mention are segmented as follows:
['平等', '是', '由', '宪法', '保障', '的']
['社会保障', '吗', '他们', '在', '说', '什么', '笑话']
['德国', '在', '1880', '年代', '采取', '了', '一种', '社会保障', '制度']
['思想', '自由', '為', '憲法', '所', '保障']
Since the third phrase containing 保障 was only added last week, this word is only entering the rankings now, even though it is more searched for than some words that were already in the list.
In summary, the differences in the counts are due to the use of different segmenters. For Chinese, I have the feeling that my tool is more powerful than Tatoeba's but the differences are so significant that I gave up taking into account words composed of only one ideogram. As a contributor interested in the Chinese corpus, do you think it is the right decision?
Although it is a bit confusing that Tatominer's criteria differ from Tatoeba's search engine, I think it makes sense to be a bit more strict about what counts as a match for Tatominer to improve translation coverage.
But in the case of 社会保障, I think that it should count as an instance of 保障 even if it's as part of a larger noun phrase. Your examples match jieba.cut, e.g.
list(jieba.cut('德国在1880年代采取了一种社会保障制度。')) == ['德国', '在', '1880', '年代', '采取', '了', '一种', '社会保障', '制度', '。']
but there's also jieba.cut_for_search, e.g.
list(jieba.cut_for_search('德国在1880年代采取了一种社会保障制度。')) == ['德国', '在', '1880', '年代', '采取', '了', '一种', '社会', '保障', '社会保障', '制度', '。']
where 社会保障 appears both as one word and split into 社会 and 保障. I think that would be more appropriate to use in this case.
Regarding single-character words, there are quite a few characters that occur almost always on their own, so completely ignoring them wouldn't be great. If the issue is that people are searching for single characters that mostly occur in compound words, then I think treating those as matches wouldn't be too bad, since usually a compound word's meaning is derived from its components.
Thanks for your feedback.
It seems to me now that Tatoeba uses a method based on n-grams for Chinese and Japanese. It is even dirtier than the Jieaba cut_for_search method...
Tatominer also use n-grams now to better match Tatoeba results in both Chinese and Japanese. Changes will be visible tomorrow morning after the weekly update.
I also took the opportunity to rebuild all my rankings with a new method that I find more efficient. This will explain the many differences with the rankings of this week
Is is possible to provide an excel (or libreoffice/openoffice ods) file containing these search queries with actual number of found sentences?
I ask this question because a search query with a large number of frequency and a small number of found sentences, perhaps indicates that sentences containing these search queries are **on-demand** and new sentences should be added. And this provides native speakers the opportunity to add sentences containing those search queries.
I just published the lists of the most common Tatoeba queries in 150 languages: https://github.com/LBeaudoux/ta...common-queries
The data is available as CSV files that you can read with Excel or any other spreadsheet app.
Here is the file for Persian: https://raw.githubusercontent.c...ba_ranking.csv
Just quickly sorting the lines of the English file and glancing through the data makes it seem like this isn't something that would be useful for us to import if that is what you are suggesting.
The year is 1001.
The year is 1002.
The year is 1003.
The year is 1004.
The year is 1005.
The year is 1006.
The year is 1007.
The year is 1008.
The year is 1009.
He was 88 years old.
He was 89 years old.
He was 91 years old.
He was 92 years old.
He was 93 years old.
He was 94 years old.
He was born in 1155.
He was born in 1187.
He was born in 1188.
He was born in 1189.
That was in 1925.
That was in 1933.
That was in 1934.
That was in 1936.
That was in 1949.
That was in 1953.
SEE PAGE 2.
SEE PAGE 3.
SEE PAGE 4.
SEE PAGE 5.
SEE PAGE 6.
SEE PAGE 7.
SEE PAGE 8.
SEE PAGE 9.
Vardania, I.In May 2011 C.
There are many useful sentences in that list along with few that isn't something that would be useful to import.
That still means that someone who is proficient in both Abkhaz and English would need to review each sentence one-by-one to determine whether it should be imported or not. (Especially important considering the dataset was apparently generated by machine-translating monolingual data, which makes both accuracy and naturalness of the translations suspect.)
I'd encourage you to take up that task if you want, but given you rate your Abkhaz knowledge with one star, it might be better to focus on sentences you can understand without machine translation. (Otherwise it's too easy to be misled by a reasonable-sounding but incorrect translation of a complex sentence.)
I think if you don't understand a language, don't provide data on that language.
You already put data into the corpus, where we want to create information, and that isn't helpful.
This guy has collected lot of Corpus data.
Over 150,000 translated sentences from Russian to Abkhaz. However the licence must be be checked.
25000 sentences of Abkhaz to English. Data must however be checked before adding.
1. Minkä lisenssin alla nuo ovat?
2. Jos tekijänoikeudet eivät tässä tapauksessa estä käyttöä, pitäisi jonkun käydä lauseet läpi laadunvarmistuksen takia ja luultavasti muuttaa muotoon, jossa ne voi ottaa mukaan Tatoebaan. Oletko valmis tekemään vähintään toisen näistä tai tiedätkö jonkun, joka on?
Ingush audio. Muzyka kaloya - Song of Kaloy