menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search

Wall (6,180 threads)

Tips

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

Latest messages subdirectory_arrow_right

Yagurten

3 hours ago

subdirectory_arrow_right

Cabo

6 hours ago

subdirectory_arrow_right

Igider

7 hours ago

subdirectory_arrow_right

Thanuir

16 hours ago

feedback

sacredceltic

yesterday

subdirectory_arrow_right

Ivanovb

yesterday

feedback

CK

yesterday

subdirectory_arrow_right

sacredceltic

2 days ago

feedback

sacredceltic

2 days ago

feedback

CK

2 days ago

js9766416 js9766416 29 days ago March 24, 2021 at 7:22:33 AM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

js9766416 js9766416 29 days ago March 24, 2021 at 7:21:04 AM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

NurseMeeks NurseMeeks March 21, 2021 March 21, 2021 at 8:50:50 PM UTC link Permalink

Hello everyone!
I need to let you all know that my heart fills with an overwhelming amount of joy every time I see a new translation of the stuff I've submitted. It creates this child-like excitement within me because it allows me to imagine communicating with a patient in their native language!

So, thank you for helping me and everyone else in healthcare who will benefit from your selfless contributions!

And, my most humble gratitude is for all of you who have found my errors and made suggestions for corrections.

I appreciate you all so much!

:)

{{vm.hiddenReplies[36727] ? 'expand_more' : 'expand_less'}} hide replies show replies
gillux gillux March 21, 2021 March 21, 2021 at 9:40:16 PM UTC link Permalink

#5268528 :-)

bunbuku bunbuku March 21, 2021 March 21, 2021 at 11:59:44 PM UTC link Permalink

Your medical sentences are really useful for people all over the world.
Thank you, too, for submitting that. :)

{{vm.hiddenReplies[36729] ? 'expand_more' : 'expand_less'}} hide replies show replies
DJ_Saidez DJ_Saidez March 22, 2021 March 22, 2021 at 1:21:39 AM UTC link Permalink

+1

Thanuir Thanuir March 22, 2021 March 22, 2021 at 12:41:17 PM UTC link Permalink

Thank you for writing sentences with interesting specialist vocabulary.

CK CK March 20, 2021 March 20, 2021 at 4:49:04 AM UTC link Permalink

** Audio by CK **

I reached the 555,555 mark for audio files that I've added to English sentences at 04:25 UTC on 2021-03-20,

Screenshot
https://imgur.com/a/L0A2bBe

You can find the sentences with the most-recently added audio files at the top of this list.
https://tatoeba.org/eng/sentenc.../show/4000/und


30,705 (5.5%) of these sentences do not yet have translations into any language.
Perhaps you would like to be the first one to translate some of these.

Sort: Random
https://tatoeba.org/eng/sentenc...de&sort=random

Sort: Last Created
https://tatoeba.org/sentences/s...e&sort=created


Also, here is a page that I created a while back that some newer members may not yet know about, and some older members may have forgotten.

Translate English Sentences with Audio Using Pre-determined Searches
http://goo.gl/zP1V2i

{{vm.hiddenReplies[36720] ? 'expand_more' : 'expand_less'}} hide replies show replies
ZegPhig ZegPhig March 20, 2021 March 20, 2021 at 11:49:18 AM UTC link Permalink

@CK Congratulations! Thanks you for your titanic working!

soweli_Elepanto soweli_Elepanto March 20, 2021 March 20, 2021 at 6:07:21 PM UTC link Permalink

+1 You make English closer to all the people!

adamtrousers adamtrousers March 20, 2021 March 20, 2021 at 10:07:19 PM UTC link Permalink

Good work!

lbdx lbdx January 2, 2021, edited January 2, 2021 January 2, 2021 at 9:27:24 AM UTC, edited January 2, 2021 at 9:37:21 AM UTC link Permalink

Tatominer (a weekly ranking of the most underused expressions on Tatoeba) has been moved to https://tatominer.netlify.app

{{vm.hiddenReplies[36368] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 3, 2021 January 3, 2021 at 3:31:28 PM UTC link Permalink

Thanks for letting us know. I believe that you determine these words by looking at how many times they're specified in a search, right? Could you give us a general idea of that range of frequencies? Are searches for many of these words performed more than once? Also, am I right in thinking that you explicitly filter out bad searches (misspellings, mismatched languages)?

{{vm.hiddenReplies[36370] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx January 3, 2021 January 3, 2021 at 5:33:37 PM UTC link Permalink

I don't have all these frequencies at hand but I can tell you for example that regarding English, the expressions currently listed have been searched at least ten times over a period of one year. For French, it is at least twice.

Of course, in order to be usable, the content of the search log must be cleaned beforehand using spell checkers and dictionaries. I also try to remove queries that seem to be generated by bots.

{{vm.hiddenReplies[36372] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 3, 2021, edited January 4, 2021 January 3, 2021 at 11:54:50 PM UTC, edited January 4, 2021 at 2:06:12 AM UTC link Permalink

I see.

I added sentences for the English words that had no occurrences (though I know that they won't come off the list until the next time it's updated).

{{vm.hiddenReplies[36373] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx January 4, 2021 January 4, 2021 at 7:18:57 AM UTC link Permalink

Thank you for improving the coverage of English vocabulary.

I hope that one day the Tatominer features will be integrated into the Tatoeba site, which would allow to update lists of underused expressions in real time.

Sobsz Sobsz March 10, 2021 March 10, 2021 at 4:58:28 PM UTC link Permalink

bumping, but also asking where you got the search data from, since i can't find it on the downloads page or really anywhere
does an admin send the files to you specifically or

{{vm.hiddenReplies[36672] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba March 10, 2021 March 10, 2021 at 10:59:45 PM UTC link Permalink

The query data from April 2020 is available here: https://downloads.tatoeba.org/s...ueries.csv.bz2 (Directory listings are enabled on downloads.tatoeba.org, so if you want to find some data that's not explicitly advertised, you can just go explore for a bit.)

But I think lbdx also has some privileged access to get more up-to-date search logs.

{{vm.hiddenReplies[36673] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK March 10, 2021, edited March 10, 2021 March 10, 2021 at 11:44:18 PM UTC, edited March 10, 2021 at 11:47:36 PM UTC link Permalink

Queries made 1,000 or more times.(1,356)

http://tatoeba.ueuo.com/queries...2021-03-11.txt

The first number is the the count, followed by the language code, followed by the query.

@Yorwba.
Does continuing to the next page of search results count as another query?


Here is a zip file of all the queries made 2 or more times.

http://tatoeba.ueuo.com/queries...2021-03-11.zip (13.1 MB)
(There were slightly over 100 done only once.)

lbdx lbdx March 11, 2021 March 11, 2021 at 8:54:39 AM UTC link Permalink

Tatominer is based on the public dump file mentioned by Yorwba. I don't have any privileged access to the Tatoeba database.

gillux generously generated it at my request about a year ago. By the way, as this file contains annual data, it would be nice to update it in the next few weeks.

{{vm.hiddenReplies[36677] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba March 11, 2021 March 11, 2021 at 8:06:54 PM UTC link Permalink

The reason I thought you had access to more up-to-date search data is this notice in the sidebar:

> Expressions which appear for the first time in the " list are shown in bold type.

(stray quote, by the way)

And there are new bolded entries every week. I thought it might just be because new sentences were added, but the current list for Mandarin Chinese to German https://tatominer.netlify.app/cmn-deu.html contains 以免 in bold, and most of the search results are sentences by users who are no longer active: https://tatoeba.org/cmn/sentenc...ns_link=direct

{{vm.hiddenReplies[36678] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx March 13, 2021 March 13, 2021 at 9:25:12 AM UTC link Permalink

When a word listed by Tatominer finally has at least 3 example sentences, it is removed from the ranking and replaced by the next most searched word in the same language that has less than 3 sentences. This incoming phrase appears in bold during its first week in the ranking.

{{vm.hiddenReplies[36691] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba March 14, 2021 March 14, 2021 at 8:24:42 PM UTC link Permalink

Well, the current ranking https://tatominer.netlify.app/cmn-deu.html includes 保障 in bold, and there are 5 Chinese sentences containing it: https://tatoeba.org/cmn/sentenc...9%9A%9C%22&to=

#9839395 is new, but #397099 , #398923 , #3383957 and #472968 were added in 2010, 2010, 2014 and 2010, respectively, and the last one had a German translation added by me in 2020.

So I think this word should've appeared in an earlier ranking and should have its translation count listed as 1, not 0.

{{vm.hiddenReplies[36693] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx March 15, 2021 March 15, 2021 at 9:05:57 AM UTC link Permalink

For a word to be in the list of words to be translated, it must appear in at least three Tatoeba sentences.

According to Jieba (the word segmenter I use for Chinese), the sentences you mention are segmented as follows:

['平等', '是', '由', '宪法', '保障', '的']
['社会保障', '吗', '他们', '在', '说', '什么', '笑话']
['德国', '在', '1880', '年代', '采取', '了', '一种', '社会保障', '制度']
['思想', '自由', '為', '憲法', '所', '保障']
['中国共产党','的','领导','是','中国','人民','摆脱','贫困','的','根本','保障']

Since the third phrase containing 保障 was only added last week, this word is only entering the rankings now, even though it is more searched for than some words that were already in the list.

In summary, the differences in the counts are due to the use of different segmenters. For Chinese, I have the feeling that my tool is more powerful than Tatoeba's but the differences are so significant that I gave up taking into account words composed of only one ideogram. As a contributor interested in the Chinese corpus, do you think it is the right decision?

{{vm.hiddenReplies[36694] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba March 15, 2021 March 15, 2021 at 10:50:23 PM UTC link Permalink

Although it is a bit confusing that Tatominer's criteria differ from Tatoeba's search engine, I think it makes sense to be a bit more strict about what counts as a match for Tatominer to improve translation coverage.

But in the case of 社会保障, I think that it should count as an instance of 保障 even if it's as part of a larger noun phrase. Your examples match jieba.cut, e.g.
list(jieba.cut('德国在1880年代采取了一种社会保障制度。')) == ['德国', '在', '1880', '年代', '采取', '了', '一种', '社会保障', '制度', '。']
but there's also jieba.cut_for_search, e.g.
list(jieba.cut_for_search('德国在1880年代采取了一种社会保障制度。')) == ['德国', '在', '1880', '年代', '采取', '了', '一种', '社会', '保障', '社会保障', '制度', '。']
where 社会保障 appears both as one word and split into 社会 and 保障. I think that would be more appropriate to use in this case.

Regarding single-character words, there are quite a few characters that occur almost always on their own, so completely ignoring them wouldn't be great. If the issue is that people are searching for single characters that mostly occur in compound words, then I think treating those as matches wouldn't be too bad, since usually a compound word's meaning is derived from its components.

{{vm.hiddenReplies[36696] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx March 19, 2021, edited March 19, 2021 March 19, 2021 at 1:28:18 PM UTC, edited March 19, 2021 at 1:35:46 PM UTC link Permalink

Thanks for your feedback.

It seems to me now that Tatoeba uses a method based on n-grams for Chinese and Japanese. It is even dirtier than the Jieaba cut_for_search method...

Tatominer also use n-grams now to better match Tatoeba results in both Chinese and Japanese. Changes will be visible tomorrow morning after the weekly update.

I also took the opportunity to rebuild all my rankings with a new method that I find more efficient. This will explain the many differences with the rankings of this week

cojiluc cojiluc March 16, 2021, edited March 16, 2021 March 16, 2021 at 9:47:35 AM UTC, edited March 16, 2021 at 9:51:00 AM UTC link Permalink

Is is possible to provide an excel (or libreoffice/openoffice ods) file containing these search queries with actual number of found sentences?

I ask this question because a search query with a large number of frequency and a small number of found sentences, perhaps indicates that sentences containing these search queries are **on-demand** and new sentences should be added. And this provides native speakers the opportunity to add sentences containing those search queries.

{{vm.hiddenReplies[36698] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx March 19, 2021, edited March 19, 2021 March 19, 2021 at 1:14:50 PM UTC, edited March 19, 2021 at 1:33:59 PM UTC link Permalink

I just published the lists of the most common Tatoeba queries in 150 languages: https://github.com/LBeaudoux/ta...common-queries

The data is available as CSV files that you can read with Excel or any other spreadsheet app.

Here is the file for Persian: https://raw.githubusercontent.c...ba_ranking.csv

QAzaqQA QAzaqQA March 16, 2021, edited March 16, 2021 March 16, 2021 at 6:16:13 PM UTC, edited March 16, 2021 at 6:17:20 PM UTC link Permalink

https://object.pouta.csc.fi/Tat...abk-eng.abk.gz
https://object.pouta.csc.fi/Tat...abk-eng.eng.gz

More than 12000 sentence corpus of Abkhaz to English,

{{vm.hiddenReplies[36699] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK March 16, 2021, edited March 16, 2021 March 16, 2021 at 11:11:26 PM UTC, edited March 16, 2021 at 11:12:06 PM UTC link Permalink

From here?

https://github.com/Helsinki-NLP...ranslations.md

Just quickly sorting the lines of the English file and glancing through the data makes it seem like this isn't something that would be useful for us to import if that is what you are suggesting.

The year is 1001.
The year is 1002.
The year is 1003.
The year is 1004.
The year is 1005.
The year is 1006.
The year is 1007.
The year is 1008.
The year is 1009.
[snip]

He was 88 years old.
He was 89 years old.
He was 91 years old.
He was 92 years old.
He was 93 years old.
He was 94 years old.
[snip]

He was born in 1155.
He was born in 1187.
He was born in 1188.
He was born in 1189.
[snip]

That was in 1925.
That was in 1933.
That was in 1934.
That was in 1936.
That was in 1949.
That was in 1953.
[snip

Yes, yes.
146 B.C.E.
166 B.C.E.
172 B.C.E.
174 B.C.E.
196 B.C.E.
197 B.C.E.
198 B.C.E.
199 B.C.E.
219 B.C.E.
[snip]

B., 1976
Batman).
Folklor.
Gagba M.
Ialkaou.
Rome (c.
A little.
Abbet (c.
Abkhazia.
Agrba, E.
Annas (c.
Aurea, V.
B. (U.S.)
Children.
Degoev V.
Dewdrops.
[snip]

SEE PAGE 2.
SEE PAGE 3.
SEE PAGE 4.
SEE PAGE 5.
SEE PAGE 6.
SEE PAGE 7.
SEE PAGE 8.
SEE PAGE 9.
[snip]

Seven atoms.
The jackpot.
Turin, 1890.
Uncleanness.
Vardania, I.In May 2011 C.
Irresen (Ager.
Belgique, ager.
Benetics (Ache.
Brotherly love.
Catovica (Apol.
Chandor Petefi.

{{vm.hiddenReplies[36700] ? 'expand_more' : 'expand_less'}} hide replies show replies
QAzaqQA QAzaqQA March 17, 2021 March 17, 2021 at 5:46:29 AM UTC link Permalink

There are many useful sentences in that list along with few that isn't something that would be useful to import.

{{vm.hiddenReplies[36704] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba March 17, 2021 March 17, 2021 at 10:14:20 PM UTC link Permalink

That still means that someone who is proficient in both Abkhaz and English would need to review each sentence one-by-one to determine whether it should be imported or not. (Especially important considering the dataset was apparently generated by machine-translating monolingual data, which makes both accuracy and naturalness of the translations suspect.)

I'd encourage you to take up that task if you want, but given you rate your Abkhaz knowledge with one star, it might be better to focus on sentences you can understand without machine translation. (Otherwise it's too easy to be misled by a reasonable-sounding but incorrect translation of a complex sentence.)

Cabo Cabo March 18, 2021 March 18, 2021 at 1:04:59 PM UTC link Permalink

I think if you don't understand a language, don't provide data on that language.

You already put data into the corpus, where we want to create information, and that isn't helpful.

QAzaqQA QAzaqQA March 18, 2021, edited March 18, 2021 March 18, 2021 at 4:43:33 AM UTC, edited March 18, 2021 at 12:37:37 PM UTC link Permalink

This guy has collected lot of Corpus data.

https://github.com/danielinux7/...arallel-Corpus

QAzaqQA QAzaqQA March 18, 2021 March 18, 2021 at 5:51:14 AM UTC link Permalink

Over 150,000 translated sentences from Russian to Abkhaz. However the licence must be be checked.

https://github.com/danielinux7/...-27-07.bifixed

QAzaqQA QAzaqQA March 18, 2021 March 18, 2021 at 4:49:19 AM UTC link Permalink

Kabardian Dictionary

http://www.amaltus.com/%d0%b7%d...7%d0%ba%d0%b8/

QAzaqQA QAzaqQA March 17, 2021 March 17, 2021 at 5:51:55 AM UTC link Permalink

25000 sentences of Abkhaz to English. Data must however be checked before adding.

https://object.pouta.csc.fi/Tat...ge/abk-eng.tar

{{vm.hiddenReplies[36707] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir March 17, 2021 March 17, 2021 at 7:32:28 AM UTC link Permalink

1. Minkä lisenssin alla nuo ovat?

2. Jos tekijänoikeudet eivät tässä tapauksessa estä käyttöä, pitäisi jonkun käydä lauseet läpi laadunvarmistuksen takia ja luultavasti muuttaa muotoon, jossa ne voi ottaa mukaan Tatoebaan. Oletko valmis tekemään vähintään toisen näistä tai tiedätkö jonkun, joka on?