Register Log in
language English

chevron_right Register

chevron_right Log in


chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio


chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

gillux gillux December 25, 2019 December 25, 2019 at 8:49:45 PM UTC link Permalink

I wish we had better ways to categorize the corpus and characterize sentences. By this, I mean having more metadata on sentences, like formal/informal, region, grammar, theme… This is currently basically achieved by tags and lists, but they are still inadequate in many ways. We had various discussions about how to improve things, like here[1], but no clear consensus or decision came out yet.

I’d like to share my thoughts. So why good characterization and categorization is so important?

* Currently, the corpus is often presented as a giant flat list of sentences. Such giant list cannot (and should not) be used for browsing the corpus because it’s too much sentences at once. Chances are I’m gonna get tired after a few pages. Chances are I’m actually only interested in *some* of them.

* It allows to *explore* the corpus, to go into different "areas", so people can *discover* more sentences.

* It allows members to make their own sentences better known to those in need of them.

* It can serve as an incentive for more diversity, because it will show when a corpus consists of a majority of sentences of the same kind.

* If sentences of the same kind are displayed together, the corpus will look more organized so Tatoeba will be more attractive.

* It will put more focus on quality instead of quantity. Currently, I feel like Tatoeba is a lot about numbers. Number of sentences in a language, number of sentences belonging to a member, number of sentences having audio, number of contributions… All those numbers are potential incentives for producing low-quality sentences and competing for "who’s got the biggest". I wish we value more the "what" instead of the "how many".

I dream of a Tatoeba where I click some Browse link and I am presented with a palette of well-organized and inspiring metadata like categories, theme, grammar, etc., I feel like "wow, lots of interesting things in there" and it makes me want to see more.


{{vm.hiddenReplies[33780] ? 'expand_more' : 'expand_less'}} hide replies show replies
Pandaa Pandaa December 25, 2019 December 25, 2019 at 9:01:07 PM UTC link Permalink

Szép álom.
Ma azt mondom, hogy álom, de örülnék, ha egyszer meg is valósulna.

CK CK December 26, 2019, edited December 26, 2019 December 26, 2019 at 12:39:20 AM UTC, edited December 26, 2019 at 3:06:45 AM UTC link Permalink

Here are 3 other somewhat related issues
Add categories for tags

The link showing a sample at the top of this page doesn't work, so I reuploaded it to

[Edit] I found a similar page based on newer data created in March of this year.
Possibility to have different link type
Allow Searching Through Multiple Lists

Aiji Aiji December 26, 2019 December 26, 2019 at 2:22:09 AM UTC link Permalink

Here's some food for thought:

Method 1: Use tags.
Nothing really new here: This requires a better system than now. Contributed by user, requires translation system, requires strict rules, requires a moderation team.

Method 2: Try using some Machine Learning library to extract categorization.
Straightforward, contributed by admin, can simply be executed on a weekly base, prone to error due to the short average number of words in sentences, in particular words carrying meaning. Can be used to auto generate categories / tags and apply sentences to them.

That's for categorizing. The second step is about displaying. This requires a lot of UX thinking (by the way, still no info on what the "Relevant" sorting option means :)). There's no need to think about that before we have categorization, but the two ideas coming first are
1. Create a whole "Tatoeba by category" sub environment. Category cloud into sentences, Category to category (use some kind of distance between categories to link them), Environment to categorize in good condition (like we will eventually have for linking), environment to moderate categorizing in good condition.
2. Merging this kind of environment into the current Tatoeba would probably requires some major refurbishing.

We're talking about some Tatoeba 3.0 stuff here, but since you're talking about Christmas dream... :)

{{vm.hiddenReplies[33784] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK December 26, 2019, edited December 26, 2019 December 26, 2019 at 3:35:49 AM UTC, edited December 26, 2019 at 3:44:22 AM UTC link Permalink

After thinking about this a bit, I think a lot of what you want to do can be done with tags, combined with lists, using the advanced search.

* Many things can already be done to find or explore for sentences.

-- For example, ...

--- You can limit search results to one or more tags.

--- You can limit search results to sentences from one list.

--- You can combine both of the above.

--- You can further limit results by whether or not they have audio.

--- You can further limit results to sentences owned by one particular member.

--- You can find many sentences mentioning animals with a search like the following.


Other similar searches are shown on

* Perhaps focusing on improving the following would accomplish many of the other stated desires.

-- create ways to more easily tag sentences

-- encourage members to add tags of certain types (tenses, functions, situations, ...)

-- add other functions to the advanced search

-- create pre-determined links to advanced searches to help members explore.

{{vm.hiddenReplies[33785] ? 'expand_more' : 'expand_less'}} hide replies show replies
Pandaa Pandaa December 26, 2019, edited December 26, 2019 December 26, 2019 at 8:26:37 AM UTC, edited December 26, 2019 at 12:07:16 PM UTC link Permalink

Jó lenne, ha egyszerű mezei hozzájáruló is tudna tag-eket alkalmazni.
Van 6000 mondatom tag mentesen, és amiken van, ott a többjét kérnem kellett.
Plusz, ha azt akarjuk, hogy majd minden kategorizálva legyen, akkor össze kéne kötnünk a cimkék (tag) felhelyezését a mondatírással. Pl. csak akkor lehet lefordítani, amikor már van rajta valami cimke, vagy akkor írhat újabb mondatot a szerkesztő, ha már adott az előző mondathoz egyet.

Ismétlem: Jó lenne, ha egyszerű mezei hozzájáruló is tudna tag-eket alkalmazni.

Mindegy, hogy most van nekem annyi mondatom, hogy elinduljak az 'advanced contributor' címért, hogy aztán tudjak cimkéket alkalmazni, de ha 500 mondatom lenne, akkor még nem szokták megadni a címet és szintén ott állnék 500 mondattal, mind cimkézetlenül.

AlanF_US AlanF_US December 26, 2019, edited December 26, 2019 December 26, 2019 at 1:02:32 PM UTC, edited December 26, 2019 at 1:03:44 PM UTC link Permalink

> by the way, still no info on what the "Relevant" sorting option means

Several weeks ago, I added this information to the "Advanced Search" page on the wiki ( ):

Relevance: this option, which is the default, favors sentences that contain an exact match for the search query, followed by sentences with the fewest words (see description)

The word "description" in that text links here:

Thanuir Thanuir December 26, 2019 December 26, 2019 at 7:12:15 PM UTC link Permalink

Käyttöliittymä tunnisteille (eli tag) auttaisi järjestämiseen.

1. Synonyymit ratkaisivat käännöskysymyksen, vaihtoehtoiset kirjoitustavat ja väärin kirjoitetut tunnisteet.

2. Kyky luokitella tunnisteita - julistaa niitä toistensa alaluokiksi ja luoda uusia luokkia, joissa ei välttämätttä olisi lainkaan lauseita jäseninä - mahdollistaisi aihekohtaisten lausekokoelmien näyttämisen.