menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search

Wall (6,756 threads)

Tips

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

Latest messages feedback

CK

13 hours ago

feedback

sharptoothed

3 days ago

feedback

CK

5 days ago

subdirectory_arrow_right

janTuki

7 days ago

subdirectory_arrow_right

deyta

7 days ago

subdirectory_arrow_right

janTuki

7 days ago

feedback

deyta

7 days ago

subdirectory_arrow_right

Nuel

9 days ago

subdirectory_arrow_right

Cangarejo

9 days ago

subdirectory_arrow_right

Nuel

9 days ago

AlanF_US AlanF_US January 15, 2023, edited January 15, 2023 January 15, 2023 at 6:17:24 PM UTC, edited January 15, 2023 at 6:30:04 PM UTC link Permalink

To continue the discussion from these threads:

https://tatoeba.org/en/wall/sho...#message_39333
https://tatoeba.org/en/wall/sho...#message_39361
https://tatoeba.org/en/wall/sho...#message_39380
https://tatoeba.org/en/wall/sho...#message_39395

Humans need variety and context in order to understand and learn language. A single word may have several meanings. It may be used with some words but not others. It may have a level of formality that makes it only suitable for certain situations. If these nuances are to be induced from a collection of sentences, the collection must be sufficiently diverse. In addition, the human mind needs constant stimulation. Duplicate or near-duplicate sentences create boredom, and a bored mind refuses to learn.

The set of sentences "They learned some Greek while they were on vacation", "Rima and Skura learned valuable life lessons the hard way", and "Our mother, we learned later, had been saving money in a separate account" demonstrate different kinds of things that can be learned (in addition to teaching additional vocabulary). By contrast, the set of sentences "<X> learned Berber", "<Y> learned Berber", and "<Z> learned Berber" tells you nothing about Berber other than that it can be learned, teaches you nothing about learning other than that it is a process that can be applied to Berber, and gives you no vivid picture that might help your mind retain the words or even simply brighten your day. If X, Y, and Z have different gender, number, and person, and the sentences are translated into a language where this affects the form of the equivalent of the word "learned", this set of sentences can show you how to conjugate the verb "learn" in the past tense, but it won't tell you anything else. It won't even tell you as much about verb conjugation as a verb chart will. A set of sentences in random order can't give you visual clues, such as placement on the page, that bring out the inherent patterns. And with no starting point, there's no suitable place to explain important points such as where the patterns hold and where they change or break down.

Sixteen years ago, there was no shortage of dictionaries, or verb charts, or descriptions of other aspects of grammar, either printed or in electronic form. But it was hard to find to find searchable collections of sample sentences that would demonstrate the usage of words and phrases. Trang founded what became Tatoeba in order to address her own frustration with that shortage. She opened her new project up to accept contributions from virtually anyone, with no barrier to entry other than the ability to work on an electronic device over the Internet using a free and anonymous account, very few restrictions other than a set of gradually evolving guidelines that were minimally enforced, and an interface that enabled correction of errors after they had been introduced instead of a moderated pipeline that would require sentences to be approved before going live. The website also put itself in the world of social media by providing the ability to leave comments on sentences and on a general "wall", and provided statistics on the number of contributions by language and by individual contributor.

This structure was both a blessing and a curse. It contained the potential for a wide variety of people around the world to collaborate in harmony to contribute an almost infinitely diverse set of utterances. It also contained the potential for individuals to antagonize each other while mass-producing colorless sentences of minimal value that were less valuable at Tatoeba than they would have been elsewhere.

Tatoeba is somewhere between these extremes of paradise and hell. There are many worthwhile corners of variety in the corpus for people who know how to look for it and are willing to expend the effort to extricate the jewels from the trash. The problem is that large-scale contribution of near-duplicate sentences makes this harder and harder to do. Since those sentences have similar content and are added to Tatoeba at the same time, they'll all show up in the same place in your results, and will crowd out everything else. I'm not saying that adding ten "<> learned some Berber" sentences is a problem, or that someone needs to panic about adding a sentence that might be similar to one that someone else has already added. But adding a hundred "<> learned some Berber" and a hundred "<> will learn some Berber" sentences, or going on a rampage of adding a "Ziri" copy of every "Tom" sentence you can find, is another matter.

Furthermore, the elimination of variety tends to propagate itself via translation, or via copycat or retaliatory behavior. In an unintentionally ironic post below (https://tatoeba.org/en/wall/sho...essage_39426), CK filled a screen with a near-duplicate series of near-duplicate sentence pairs (35 of them!) in an apparent attempt to bolster his assertion that eliminating variety in the use of proper nouns is a way to avoid lack of variety in other aspects of language. I think the preposterousness of this idea is self-evident: it's like claiming that preventing someone from learning history is going to make them a better math student. But it's especially worth pointing out that many of those duplicate sentences were added in reaction to the ubiquity of the name "Tom" throughout the corpus. So CK, or more broadly speaking, the general elimination of one aspect of variety, had a large hand in creating the problem in the first place.

It would be nice to introduce some mechanized way of mitigating the problems introduced by mechanized creation of sentences (including using one's mind like a machine, whether or not a computer is involved). But as I have already described, I don't think this can be done easily, quickly, or in a way that would approach any meaningful consensus. Instead, I want to explain to the people who are tempted to add near-duplicate sentences to Tatoeba on an industrial scale that they are probably better off doing the same thing elsewhere, and that there are easy ways to do it that they might not have thought of.

Let's say you want to collect sentences and audio containing Kabyle placenames for a GPS, or that you want to cover all the conjugated forms of Berber verbs that you can think of. It's not as though you have thousands of contributors in unison acting on a top-security project that requires millisecond response time. Is it not simply possible to write the sentences in a spreadsheet on your computer and put the audio files you record into a folder? Or in Google Docs or Google Drive? Or set up your own simple website? (A simple search for "easy free ways to set up a database on the web" brings up 924 million results.) They'll be easier to retrieve from such a place anyway. Why involve Tatoeba at all?

Choosing the best site to help you prepare a special project takes a little bit of imagination and a little bit of effort. So does contributing sentences with enough variety to be worth something to others at Tatoeba. Don't let that stop you.

{{vm.hiddenReplies[39434] ? 'expand_more' : 'expand_less'}} hide replies show replies
shekitten shekitten January 15, 2023 January 15, 2023 at 9:48:50 PM UTC link Permalink

Has an end user ever complained about near duplicates?

Has anyone ever said, "I'm not going to use Tatoeba's corpus because of the number of near duplicates"?

It seems like a huge drain on time and resources to care about them, when there are more serious problems. Near duplicates are not a problem that affects anyone; hate speech is. The worst thing that can result from near duplicates is frustration or boredom; the worst thing that can result from hate speech is genocide.

{{vm.hiddenReplies[39435] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo January 16, 2023, edited January 16, 2023 January 16, 2023 at 6:19:42 PM UTC, edited January 16, 2023 at 6:24:06 PM UTC link Permalink

Más oldalak, melyek felhasználják az itteni mondatokat, mind rendszerezték valamely formában azokat.
Itt nincs rendszerezve, random mondatok közt vagy egy 20-as mondatlista lefordítása közben hamar megunja az ember magát ha csupa Lvl 1 mondatot talál, vagy ugyanolyanokat, csak más szereplƑvel.
Vagy legyen rendszerezhetƑbb itt pár dolog, esetleg precízebb keresési kritériumok, vagy legyen megállapodás arról, hogy kell-e nekünk ennyi adat és nem pedig információ.

Maybe not affects anyone, but more poeple than I first thought.
No, what you find hate speech, also not affects anyone, but probably some, even many, still, it is not that wall message what is about hate speech.

{{vm.hiddenReplies[39438] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cangarejo Cangarejo January 16, 2023 January 16, 2023 at 8:48:55 PM UTC link Permalink

> Itt nincs rendszerezve, random mondatok közt vagy egy 20-as mondatlista lefordítása közben hamar megunja az ember magát ha csupa Lvl 1 mondatot talál, vagy ugyanolyanokat, csak más szereplƑvel.

Have you tried using Tatominer?

{{vm.hiddenReplies[39439] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo January 16, 2023 January 16, 2023 at 9:09:20 PM UTC link Permalink

Yes, I did, tried some times, but the problem was the system not recognized correctly the words, the Hungarian language uses a ton of word endings.

{{vm.hiddenReplies[39440] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cangarejo Cangarejo January 16, 2023, edited January 18, 2023 January 16, 2023 at 9:44:47 PM UTC, edited January 18, 2023 at 10:19:10 PM UTC link Permalink

There are lemmatizers for Hungarian, but maybe they’re too slow.

Polgar1 Polgar1 January 18, 2023 January 18, 2023 at 12:50:14 PM UTC link Permalink

Yes, quite sure there have been "end users" put off by the usefulness of the content considering the size. If you have nothing to add, that's fine but please give up on attempting to hijack the topic, at least.

{{vm.hiddenReplies[39443] ? 'expand_more' : 'expand_less'}} hide replies show replies
shekitten shekitten January 29, 2023, edited January 29, 2023 January 29, 2023 at 4:24:18 PM UTC, edited January 29, 2023 at 4:29:32 PM UTC link Permalink

> Yes, quite sure there have been "end users" put off by the usefulness of the content considering the size.

Can you name a single one?

> give up on attempting to hijack the topic

The relative unimportance of this issue compared to more serious ones, ones that actually keep people away from this site, is completely relevant. We spend more time talking about avoiding near duplicates than anything else. Why aren't we fixing more serious problems?

{{vm.hiddenReplies[39505] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo January 29, 2023 January 29, 2023 at 4:32:29 PM UTC link Permalink

"The relative unimportance of this issue compared to more serious ones, ones that actually keep people away from this site, is completely relevant."

Every issue has its own wall post.
If it has no post, just start one, but do not try to derail a different issue conversation.
Thank you.

{{vm.hiddenReplies[39506] ? 'expand_more' : 'expand_less'}} hide replies show replies
January 29, 2023 January 29, 2023 at 4:33:51 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

{{vm.hiddenReplies[39507] ? 'expand_more' : 'expand_less'}} hide replies show replies
January 29, 2023, edited January 29, 2023 January 29, 2023 at 4:37:46 PM UTC, edited January 29, 2023 at 4:39:47 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

February 6, 2023 February 6, 2023 at 2:26:55 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

{{vm.hiddenReplies[39531] ? 'expand_more' : 'expand_less'}} hide replies show replies
February 6, 2023, edited February 6, 2023 February 6, 2023 at 6:14:47 PM UTC, edited February 6, 2023 at 6:28:40 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

{{vm.hiddenReplies[39532] ? 'expand_more' : 'expand_less'}} hide replies show replies
February 6, 2023, edited February 6, 2023 February 6, 2023 at 7:58:28 PM UTC, edited February 6, 2023 at 8:17:40 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

{{vm.hiddenReplies[39533] ? 'expand_more' : 'expand_less'}} hide replies show replies
February 6, 2023 February 6, 2023 at 8:40:30 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

{{vm.hiddenReplies[39534] ? 'expand_more' : 'expand_less'}} hide replies show replies
February 6, 2023, edited February 6, 2023 February 6, 2023 at 8:51:18 PM UTC, edited February 6, 2023 at 8:52:15 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

February 6, 2023 February 6, 2023 at 9:00:10 PM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

Polgar1 Polgar1 January 29, 2023 January 29, 2023 at 4:36:38 PM UTC link Permalink

> Can you name a single one?

MaxDailene.

Now can we move on to actually addressing the problem?

{{vm.hiddenReplies[39508] ? 'expand_more' : 'expand_less'}} hide replies show replies
shekitten shekitten January 29, 2023 January 29, 2023 at 8:23:54 PM UTC link Permalink

We've been addressing near duplicates for over a decade. I don't think they're going anywhere.

Clearly, a lot of people - not just me, who actually uses the names Tom and Mary most of the time - see much bigger and more serious problems than near-duplicates. Like the fact that the names Tom and Mary do not reflect their culture. And seeing this, they ignore the demand to eliminate near duplicates, which they never voted on. I support them doing this. It's unjust that people are pressured to use the names Tom and Mary, which is an extremely political and polarizing demand.

And when @AlanF_US ignores people who ask legitimate questions, what else should anyone do besides ignore his demands?

{{vm.hiddenReplies[39515] ? 'expand_more' : 'expand_less'}} hide replies show replies
Polgar1 Polgar1 February 15, 2023 February 15, 2023 at 12:37:45 PM UTC link Permalink

It's a bit bizarre that you yourself say that a community solution of this problem is willingly sabotaged by some people, including you, shortly after announcing that the problem won't be solved anyway. Well, maybe then do your share of solving the problem and we'll all be happily able to forget it for good?

Also, I don't see it anywhere written that the elimination of near duplicates must mean that all sentences must be with the names "Tom" and "Mary"; I rather read it the opposite way - don't add yet another "Tom" or "Mary" variant just to finish this weird collection.

And finally, unlike the topic of near duplicates - which is a meaningful topic regarding the evaluation of a linguistic corpus - bringing "extremely political and polarizing" narratives is just not a meaningful, constructive topic for Tatoeba. This is quite an important difference between the "demands" - one is approachable for everyone simply by using universal reasoning, the other requires a political lens, hence starting off by arbitrary, not-so-well-agreed-upon division of a community that really just could work on common principles.

gillux gillux January 21, 2023 January 21, 2023 at 5:14:42 AM UTC link Permalink

It feels sad but thank you for taking the time to write about these issues. Call me a "technical solutionist", but I believe there are technical ways to at least mitigate the problem, in addition to convincing members to change their behavior. But as you said, these are costly so they need to be carefully thought before implementing.

Just to make sure we are talking about the same thing: Basically, you cannot get useful results when searching because all the results look similar, so you need to scroll over and over in order to find useful sentences.

I saw interesting ideas here:
https://github.com/Tatoeba/tatoeba2/issues/2816

I'd like to suggest the following approaches, in order of estimated cost of development (from cheapest to most expensive):
- add a "number of words" search filter to easily exclude short sentences
- expose per-language (and per-user?) statistics about sentences "diversity", similar to what can be seen in the Github issue
- introduce a new smarter ranking algorithm that favor sentences based on their uniqueness, length or other criterion, using GDEX as an inspiration https://www.sketchengine.eu/guide/gdex/ ). This however brings new political problems of how to decide on the criteria.
- cluster search results so that similar sentences are grouped, only display one sentence of each group, but allow clicking on a group to see all the hidden sentences

{{vm.hiddenReplies[39444] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx January 21, 2023 January 21, 2023 at 10:12:20 AM UTC link Permalink

@gillux Thank you for the update on these technical solutions. I hope we will find time to implement some of them in the future.

I would like to add to your list of features the possibility to "mute" a contributor in the settings: https://github.com/Tatoeba/tatoeba2/issues/2008

AlanF_US AlanF_US January 21, 2023 January 21, 2023 at 5:16:56 PM UTC link Permalink

@gillux Now that we've had a chance to hold a comprehensive discussion of behavior, I'm happy to have us discuss technical solutions. Any of the ones that you and @lbdx mentioned would be helpful.

Until we are able to implement these technical approaches, people can use random search when looking up words or when finding candidate sentences to translate. One additional technical solution would be to make random search the default, rather than "relevance" (which favors shorter sentences and tends to show near-duplicates in clusters). I can imagine some resistance to this idea, since some users probably prefer seeing the shorter sentences even if they lack context, but it's something we can discuss.

qwertzu qwertzu January 21, 2023 January 21, 2023 at 7:00:31 AM UTC link Permalink

I believe this is a search issue rather than a contribution issue.

Here's an example.
The word "content" has the following word senses (from Wiktionary):

A. satisfied
B. that which is contained
C. subject matter
D. the amount of material contained
E. mathematics: space contained by a polytope

As a thought experiment, suppose that the corpus contains 100 example sentences
of each word sense. If the search results displayed all the A sentences before all the B sentences, and all the B sentences before all the C sentences, etc., then there's going to be a perceived lack of diversity, regardless of how diverse the corpus really is.

{{vm.hiddenReplies[39445] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx January 21, 2023 January 21, 2023 at 10:26:59 AM UTC link Permalink

To be able to filter the results according to the senses of the searched word would indeed be a very nice feature. However, I sincerely doubt that we will be able to implement it for more than 400 languages anytime soon.

AlanF_US AlanF_US January 21, 2023, edited January 21, 2023 January 21, 2023 at 4:48:26 PM UTC, edited January 21, 2023 at 5:19:34 PM UTC link Permalink

> If the search results displayed all the A sentences before all the B sentences, and all the B sentences before all the C sentences, etc., then there's going to be a perceived lack of diversity, regardless of how diverse the corpus really is.

Diversity in terms of word sense is a valid thing to consider, but it's the kind of criterion that someone would apply only if they were doing an analysis that required some thought. I was talking about the lack of diversity that jumps out at someone instantly because their search results are all identical except for variation in a single respect (such as <placename> in sentences of the form "I drove to <placename>").

Also, sentences with sense A, B, C, etc. will only end up clustered together in the search results if they share a characteristic with the criterion beng used for the sort (for instance, the user is searching by sentence creation date, and sentences with sense A all happened to be added before sentences with sense B). And sentences with sense A will only crowd sentences with sense B off the page if there are lots of sentences with sense A. For these reasons, I think it's unlikely that there are many pockets of "hidden sense diversity" that can be found in the tail end of search results but are not visible closer to the beginning.

It's true that there may be more diversity in the corpus than is visible on a single page of search results. However, we can't expect all users to either scroll for an indefinitely long period of time or use sorting criteria to increase the variety at the top of the search results.

lbdx lbdx January 21, 2023 January 21, 2023 at 9:55:04 AM UTC link Permalink

> I want to explain to the people who are tempted to add near-duplicate sentences to Tatoeba on an industrial scale that they are probably better off doing the same thing elsewhere

@AlanF_US Thank you for taking the time to talk to the few Tatoebans involved. It looks like you've managed to convince them to reduce the volume of their contributions (at least temporarily). This is really good news 🎉

Some of the larger contributors may be wondering if the sentences they have added recently are diverse enough. You can check the table at the bottom of https://colab.research.google.c...P8?usp=sharing that gives a lexical diversity score to the current month's contributions. The measure used is called MTLD. It is recognized as reliable because it does not vary with the length of the texts analyzed. The higher the MTLD score, the more diverse the sentences added by a contributor. If your score is below 25, your sentences probably contain a high proportion of near-duplicates.

{{vm.hiddenReplies[39446] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 21, 2023 January 21, 2023 at 4:55:40 PM UTC link Permalink

> It looks like you've managed to convince them to reduce the volume of their contributions (at least temporarily). This is really good news

I'm happy to hear that.

The idea of displaying the MTLD for a contributor is interesting. Would it be feasible to give someone the ability to do one of the following?

(1) see the figures for the top X contributors, where X is some arbitrary number other than the one you're already using

(2) type in a username to see the figure for that contributor alone

{{vm.hiddenReplies[39451] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx January 21, 2023 January 21, 2023 at 6:00:23 PM UTC link Permalink

For now, the online tatowatch notebook is read-only but those who would like to edit and run it can copy it to their Google drive or download it to their own machine.

The idea of this project is to help the moderation team to keep an eye on new contributions. If any of you have feature suggestions or want to participate in this project, please feel free to contact me in private message.

yaron yaron February 11, 2023 February 11, 2023 at 7:10:58 PM UTC link Permalink

Ś—Ś‘ŚšŚ™ Ś“Ś•Ś‘ŚšŚ™ Ś”ŚąŚ‘ŚšŚ™ŚȘ, ڐڠڗڠڕ Ś‘ŚžŚšŚ—Ś§ 4 ŚžŚ©Ś€Ś˜Ś™Ś ŚžŚȘŚšŚ’Ś•Ś ŚžŚœŚ کڜ ŚžŚ Ś©Ś§ Ś”ŚžŚ©ŚȘŚžŚ© ŚœŚąŚ‘ŚšŚ™ŚȘ, ŚžŚ ŚĄŚ™Ś‘Ś•ŚȘ کڜ Ś—Ś•ŚĄŚš ڔڑڠڔ ŚžŚąŚžŚ™Ś§Ś” Ś‘Ś‘ŚœŚ©Ś Ś•ŚȘ ڙڀڠڙŚȘ Ś•Ś“ŚšŚ™Ś©Ś•ŚȘ Ś”ŚžŚąŚšŚ›ŚȘ ڐڠڙ ڜڐ Ś™Ś•Ś“Śą ŚŚ™Śš ڜŚȘŚšŚ’Ś ڐŚȘ 4 Ś”Ś”Ś•Ś“ŚąŚ•ŚȘ ڔڠڕŚȘŚšŚ•ŚȘ.
ŚŚ©ŚžŚ— ŚŚ Ś™Ś•Ś“ŚąŚ™ Ś“Ś‘Śš Ś™Ś•Ś›ŚœŚ• ڜŚȘŚȘ Ś©Ś Ś›ŚȘŚŁ Ś•Ś ŚĄŚ’Ś•Śš ڐŚȘ ڠڕکڐ Ś”ŚȘŚšŚ’Ś•Ś ŚœŚąŚȘ ŚąŚȘŚ”.

ŚȘڕړڔ!

{{vm.hiddenReplies[39539] ? 'expand_more' : 'expand_less'}} hide replies show replies
fekundulo fekundulo February 14, 2023 February 14, 2023 at 3:44:52 PM UTC link Permalink

ŚžŚ•Ś›ŚŸ ŚœŚąŚ–Ś•Śš, ŚžŚ” Ś”Ś Ś”ŚžŚ©Ś€Ś˜Ś™Ś?

{{vm.hiddenReplies[39558] ? 'expand_more' : 'expand_less'}} hide replies show replies
yaron yaron February 15, 2023, edited February 19, 2023 February 15, 2023 at 6:42:53 AM UTC, edited February 19, 2023 at 7:43:02 AM UTC link Permalink

Ś”ŚšŚŚ©Ś•ŚŸ:
Japanese Indices
Ś§Ś™Ś©Ś•Śš ŚœŚžŚ§Ś•Śš ڑڧڕړ:
https://github.com/Tatoeba/tato...loads.ctp#L428
ڔکڠڙ:
Contains the equivalent of the "B lines" in the Tanaka Corpus file distributed by Jim Breen. See <1>this page</1> for the format. Each entry is associated with a pair of Japanese/English sentences. {sentenceId} refers to the id of the Japanese sentence. {meaningId} refers to the id of the English sentence.
Ś§Ś™Ś©Ś•Śš ŚœŚžŚ§Ś•Śš ڑڧڕړ:
https://github.com/Tatoeba/tato...loads.ctp#L438

ڙک ŚąŚ•Ś“ Ś©ŚȘŚ™ ŚžŚ—ŚšŚ•Ś–Ś•ŚȘ ŚąŚ Ś”ŚĄŚ‘ŚšŚ™Ś Ś§ŚŠŚȘ ŚžŚ‘ŚœŚ‘ŚœŚ™Ś ŚŚ‘Śœ ڐڠڙ ڐڙڛکڔڕ ŚŚĄŚȘŚ“Śš ڐŚȘڟ, ŚžŚąŚ‘Śš ŚœŚ›Śš ڙک Ś©ŚžŚ•ŚȘ کڜ ŚžŚ’Ś•Ś•ŚŸ کڀڕŚȘ ŚžŚŚ•Ś“ ŚŚ™Ś–Ś•Ś˜ŚšŚ™Ś•ŚȘ کڐڠڙ ڜڐ ŚžŚŠŚœŚ™Ś— ŚœŚžŚŠŚ•Ś Ś©Ś•Ś ŚĄŚ™ŚžŚ•Ś›Ś™ŚŸ ŚœŚŚ•Ś€ŚŸ Ś”ŚȘŚšŚ’Ś•Ś Ś©ŚœŚ”ŚŸ ŚœŚąŚ‘ŚšŚ™ŚȘ (ڙک Ś›ÖŸ57 کڀڕŚȘ Ś›ŚŚœŚ”, Ś”Ś›Ś•Śœ ŚžŚ•Ś€Ś™Śą Ś‘Ś˜ŚšŚ Ś–Ś™Ś€Ś§ŚĄ کڜ Ś”Ś€ŚšŚ•Ś™Ś§Ś˜).
ŚȘڕړڔ :)

February 14, 2023 February 14, 2023 at 11:09:11 AM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

CK CK February 14, 2023, edited February 14, 2023 February 14, 2023 at 10:57:26 AM UTC, edited February 14, 2023 at 11:02:27 AM UTC link Permalink

🍎 Japanese N-grams

æ—„æœŹèȘž N-gram Searches of the Tatoeba Corpus - Top 4,000
http://tatoeba.ueuo.com/jpn_ngrams-1.html


The following is another set of longer N-grams that are "sentence endings."

æ—„æœŹèȘž ・ N-grams ending with a sentence boundary
http://tatoeba.ueuo.com/jpn_7-g...e_endings.html

These are pages I put together in 2016 based on research done by Susumu Yata.
The "advanced search" links will return whatever is currently in the tatoeba.org database.

CK CK February 13, 2023, edited February 13, 2023 February 13, 2023 at 7:24:58 AM UTC, edited February 13, 2023 at 7:26:19 AM UTC link Permalink

🍎 We now have over 200 Swedish sentences with audio.

The list, showing only English translations
https://tatoeba.org/en/sentence...how/171175/eng

The list, showing all translations
https://tatoeba.org/en/sentence...how/171175/und

ull joined our project on February 8, 2023.

CK CK February 12, 2023 February 12, 2023 at 4:20:48 PM UTC link Permalink

> 11,111,107 sentences

16:20 UTC
February 12, 2023

Soon, we'll hit 11,111,111.

An English-speaking Japanese dog might say, the following. 😊

ワン、 ワン、
ワン、 ワン、 ワン、
ワン、 ワン、 ワン 。

{{vm.hiddenReplies[39546] ? 'expand_more' : 'expand_less'}} hide replies show replies
Pfirsichbaeumchen Pfirsichbaeumchen February 12, 2023 February 12, 2023 at 6:55:48 PM UTC link Permalink

That's funny. 😊😊😊

small_snow small_snow February 12, 2023 February 12, 2023 at 9:04:53 PM UTC link Permalink

đŸ€­

iopq iopq January 29, 2023 January 29, 2023 at 1:04:31 PM UTC link Permalink

Is there a list of sentences by difficulty? Let's say I wanted to order 500 easiest sentences in Korean, how would do I this?

{{vm.hiddenReplies[39500] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 29, 2023, edited January 29, 2023 January 29, 2023 at 3:30:16 PM UTC, edited January 29, 2023 at 3:32:47 PM UTC link Permalink

The Tatoeba collection in any language is not fundamentally sorted. You can search through it with a number of criteria (length, creation date), but none of them pertain to difficulty. Assuming a Tatoeba member could come up with a workable measure of difficulty, they could choose to add tags to particular sentences indicating their perceived difficulty, or they could add a list ("500 easiest sentences in Korean") and assign sentences to it. But they couldn't impose a sorting order, such as easiest to hardest, on a set of sentences.

However, Tatoeba makes its sentences freely available, and I know of at least one site, clozemaster.com, that takes these sentences and groups them into categories by frequency of the least common word within a sentence ("100 Most Common", "500 Most Common", and so on). The rareness of the words in a sentence is not the only measure of its difficulty, but it's probably the easiest measure to calculate in that regard.

{{vm.hiddenReplies[39502] ? 'expand_more' : 'expand_less'}} hide replies show replies
iopq iopq January 30, 2023 January 30, 2023 at 8:01:21 AM UTC link Permalink

There's two ways to grade difficulty, either x grade reading level or language level exam level. For example in Korean, if you passed TOPIK test level 3, should be expected to know this sentence or not?

On another note, there's too many lists. I can't possibly see if someone did create a list I'm interested in, there's thousands of them

{{vm.hiddenReplies[39519] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo January 30, 2023, edited January 30, 2023 January 30, 2023 at 2:51:13 PM UTC, edited January 30, 2023 at 2:51:43 PM UTC link Permalink

Clozemaster sorted the sentences in one way using word frequency lists.
You can do the same by creating a code or ask someone to do the work for you.
In this corpora you can't really sort things the way what is best for your criteria.

{{vm.hiddenReplies[39521] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK January 30, 2023, edited January 30, 2023 January 30, 2023 at 5:10:50 PM UTC, edited January 30, 2023 at 5:22:27 PM UTC link Permalink

◌ For English sentences sorted by vocabulary levels, you can try these lists.

CK's OGTE-Level Lists
http://goo.gl/BnPz6h

Created in 2017.

{{vm.hiddenReplies[39522] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 30, 2023 January 30, 2023 at 7:21:27 PM UTC link Permalink

Reading between the lines:

CK probably used the Online Graded Text Editor (OGTE), a tool that assigns English text a level based on the vocabulary it contains, in order to classify English sentences and assign them to lists that he produced on Tatoeba. Naturally, this doesn't suit your needs directly, since you want Korean text, not English. However, you could presumably look for sentences within these lists that contain Korean translations, assuming that the classification of vocabulary difficulty for the Korean sentences will match those for the English sentences. Alternatively, if you can find such a classification tool for Korean, know how to program, and have a lot of time and motivation, you can do the same kind of thing for Korean that CK did for English. That's a big "if".

As for looking through the lists on Tatoeba, it's true that there are a lot, but you can search through the titles of the lists. I did a search for "TOPIK" and for "Korean", but didn't find anything useful.

If you want to use Tatoeba-sourced sentences, it looks like Clozemaster is probably your best alternative.

{{vm.hiddenReplies[39523] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK January 30, 2023 January 30, 2023 at 11:08:38 PM UTC link Permalink

> CK probably used the Online Graded Text Editor (OGTE), a tool that assigns English text a level based on the vocabulary

That's right, as the 2nd sentence on that page says, ...

These lists were created based only on vocabulary, using er-central.com/ogte, so some of the sentences will use grammar and idioms that are above the level of the lists.

Yorwba Yorwba January 30, 2023 January 30, 2023 at 9:06:54 PM UTC link Permalink

Since I already had a list of Korean sentences sorted by word frequency, I uploaded the first 500: https://tatoeba.org/en/sentence...&direction=asc

Each sentence contains at least one word that doesn't appear in any of the 499 other sentences, so they're definitely not the 500 easiest ones (which likely contain a lot of repetition) but maybe you'll find the list useful nonetheless.

Thanuir Thanuir January 31, 2023 January 31, 2023 at 9:05:05 AM UTC link Permalink

Ankissa on Tatoebasta (ilman mainintaaa) haettuja ja järjestettetyjä lauseita sisältävä pakka: https://ankiweb.net/shared/info/241481292

Sama henkilö on tehnyt useita vastaavia eikä mainitse Tatoebaa, mutta valtava Tom- ja Mary-lauseiden määrä paljastaa lähteen. https://frequencylists.blogspot.com/

{{vm.hiddenReplies[39526] ? 'expand_more' : 'expand_less'}} hide replies show replies
mollydot mollydot February 3, 2023 February 3, 2023 at 3:32:49 PM UTC link Permalink

I also believe she is using machine translation. I queried a few short Finnish ones with native Finns, and the English translations were bad.

{{vm.hiddenReplies[39528] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir February 3, 2023 February 3, 2023 at 5:11:10 PM UTC link Permalink

Ehkä. Olen kuitenkin löytänyt muutaman kirjoitusvirheen tanskankielisissä lauseissa tuollaisen pakan kautta.

Epäilen sen johtuvan enemmän siitä, että Tatoeba, varsinkaan vanhempien lauseiden ja pienempien kielten kohdalla, ei aina ole laadukas. Mutta ehkä siellä on konekäännöksiäkin, en siitä tiedä.

ssvb ssvb February 11, 2023 February 11, 2023 at 1:07:43 AM UTC link Permalink

> I also believe she is using machine translation. I queried a few short Finnish ones with native Finns, and the English translations were bad.

Are these bad sentences also present in the tatoeba database? If that's the case, then is anyone doing anything to remove or fix them?

{{vm.hiddenReplies[39537] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir February 11, 2023 February 11, 2023 at 8:11:35 AM UTC link Permalink

https://tatoeba.org/es/tags/sho...th_tag/561/fin - suomenkieliset lauseet, joita pitäisi muokata. Osa on ollut listalla pitkään. Ehkä pitäisi itse pyytää muokkausoikeuksia niihin, koska kukaan ei näytä niitä muuten korjaavan.

En tiedä, ovatko mainitut lauseet tunnisteella merkittyjä, tokikaan.

CK CK January 23, 2023 January 23, 2023 at 7:22:19 AM UTC link Permalink

🍎 A list with over 2,000 Japanese sentences that don't have kanji

https://tatoeba.org/en/sentences_lists/show/170911

All sentences on this list are owned by native Japanese speakers.

{{vm.hiddenReplies[39476] ? 'expand_more' : 'expand_less'}} hide replies show replies
sacredceltic sacredceltic January 24, 2023 January 24, 2023 at 10:26:56 AM UTC link Permalink

Maybe Kanjis are out of fashion nowadays with younger natives…

{{vm.hiddenReplies[39478] ? 'expand_more' : 'expand_less'}} hide replies show replies
Objectivesea Objectivesea January 28, 2023 January 28, 2023 at 8:33:22 PM UTC link Permalink

With respect, I beg to differ, @sacredceltic. It is true that after World War II, the Japanese Ministry of Education has curtailed the use of kanji to some degree, but a Japanese high school graduate will have learnt the 2,136 kanji and will generally use these appropriately.

https://en.wikipedia.org/wiki/List_of_jƍyƍ_kanji

I do not speak Japanese, but I picked one sentence at random from @CK’s list (https://tatoeba.org/en/sentence...s/show/170911) and entered it into Google Translate:

#11011895 — posted by @small_snow
ăƒă‚«ă˜ă‚ƒăȘă„ăźïŒŸ

Google translated it as « Êtes-vous stupide ? » Reversing the direction of translation — that is, going from French to generate Japanese, reproduced the original sentence, written « ăƒă‚«ă˜ă‚ƒăȘă„ăźïŒŸ» or, in romaji, « Bakajanaino? »

While the majority of sentences in Japanese will likely include one or more kanji, I think that @CK may just have intended to produce an interesting list of sentences that correctly use no kanji *because no kanji are needed* to express the words constituting those sentences, currently numbering 2,351.

Rather than somehow labelling these 2,351 sentences as being informal or less literary Japanese, I think the intent may just have been to help beginning students of Japanese, who will have mastered the hiragana syllabary but who have not yet learnt many of the jƍyƍ kanji — a process which takes many years in a typical Japanese person's education.

Kind regards,
Erik (Objectivesea)

{{vm.hiddenReplies[39484] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK January 29, 2023, edited January 29, 2023 January 29, 2023 at 3:15:01 AM UTC, edited January 29, 2023 at 3:20:49 AM UTC link Permalink

> ... I think that @CK may just have intended to produce an interesting list of sentences that correctly use no kanji *because no kanji are needed* to express the words constituting those sentences ...

Yes, that's right.

You can see details here with lists that also introduce one new kanji per list.

https://tatoeba.org/en/user/profile/CJ

maaster maaster January 28, 2023 January 28, 2023 at 7:28:34 AM UTC link Permalink

Is it only with my Tatoeba being so often out of order in the last two weeks?

{{vm.hiddenReplies[39480] ? 'expand_more' : 'expand_less'}} hide replies show replies
DJ_Saidez DJ_Saidez January 28, 2023 January 28, 2023 at 7:34:01 AM UTC link Permalink

No, I've had the same problem.

{{vm.hiddenReplies[39481] ? 'expand_more' : 'expand_less'}} hide replies show replies
DJ_Saidez DJ_Saidez January 29, 2023 January 29, 2023 at 5:21:16 AM UTC link Permalink

From what I've heard it's to prevent a server overload during periods of high activity. Until recently that almost never happened to me though.

sundown sundown January 28, 2023 January 28, 2023 at 10:05:58 PM UTC link Permalink

I've hardly been able to load a page here for the past three hours.

LanguageExpert LanguageExpert January 28, 2023 January 28, 2023 at 10:07:46 PM UTC link Permalink

No, it's happening with me too. I'm glad to know it's not just me. I was wondering if it was just me.

Yorwba Yorwba January 29, 2023 January 29, 2023 at 11:53:47 AM UTC link Permalink

An overeager crawler was making a bunch of expensive requests that kept overloading the server. I've now changed our configuration to block that crawler; hopefully that will improve the situation.

If you keep getting the "Tatoeba is currently unavailable." message, let us know.

{{vm.hiddenReplies[39495] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US January 29, 2023 January 29, 2023 at 5:57:44 PM UTC link Permalink

Thanks, Yorwba!

DJ_Saidez DJ_Saidez January 29, 2023 January 29, 2023 at 6:32:31 PM UTC link Permalink

Thanks!

small_snow small_snow January 29, 2023 January 29, 2023 at 10:11:34 PM UTC link Permalink

ありがべう。

LanguageExpert LanguageExpert January 29, 2023 January 29, 2023 at 10:59:02 PM UTC link Permalink

Thank you so much!

Pfirsichbaeumchen Pfirsichbaeumchen January 30, 2023 January 30, 2023 at 2:44:24 AM UTC link Permalink

Danke schön! 😊

January 29, 2023, edited 7 days ago January 29, 2023 at 6:09:09 AM UTC, edited March 22, 2023 at 12:13:30 AM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

{{vm.hiddenReplies[39490] ? 'expand_more' : 'expand_less'}} hide replies show replies
January 29, 2023 January 29, 2023 at 7:29:29 AM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.