clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search

Wall (5349 threads)

gillux
2019-04-06 17:59 - 2019-04-06 18:24
I’ve been playing around with our default search ranking algorithm. I insist on the "default" part because that’s what the vast majority of visitors use. I also focus on searches that do not use double quotes or any special trick. Just plain words. Again because that’s what the vast majority of visitors use.

Our current way of ranking results is pretty basic: it searches for sentences that include all the words (eventually stemmed) and sort them by total number of words in the sentence.

A problem with this approach is that the order of the words is ignored. The top result of searching for "you go there" is "There you go!" because it’s a shorter sentence than "You may go there."

Ignoring word order is especially catastrophic on languages without word boundaries, like Chinese, because the searched characters are randomly reordered into something totally unrelated. For example, the results for "可不可" in Chinese are cluttered by irrelevant "不可something". Same for kana words in Japanese.

In order to address this problem, I tentatively tweaked the default ranking algorithm on https://dev.tatoeba.org/ into something that prioritize, in the following order:

1. sentences that contains an exact match (like if searching for ="you go there")
2. sentences having the "longest common subsequence" (LCS, [1])
3. sentences having the least number of words

[1] https://docs.manticoresearch.co...anking-factors

However, I don’t know if this new ranking suits everyone out there. What do you think?

You can compare the search results on https://tatoeba.org/ (old ranking) and https://dev.tatoeba.org/ (new ranking). You can run a search on tatoeba.org, and then add "dev." in the URL bar and press alt+return to open a new tab.
hide replies
AlanF_US
2019-04-06 22:10 - 2019-04-06 22:11
I do prefer a ranking that favors exact matches over stemmed matches. Longest common subsequence also sounds good. But sentences having the least number of words are often not the ones I want to see most. I prefer slightly longer ones that give me more context. For that reason, I always choose random ordering. It doesn't always put the sentences that I want at the very top, but at least I have a good chance of finding them without having to go through pages and pages of very short sentences. Also, providing a mix of sentences that is random with regard to sentence length lets people see more diversity. I think that's a good thing.
hide replies
CK
CK
2019-04-07 02:34 - 2019-04-07 02:37
Maybe you wouldn't want this for the default search, but I wonder if it would be possible to add a "minimum word" option to the advanced search. This might prove useful. For example, members could still opt to sort by length, but start by showing sentences that are over a certain length up to 1,000 results.

More involved, perhaps, but an additional idea might be to have a "maximum length" option, too. This would allow members to have search results displayed randomly between 5 and 12 words in length, for example.
hide replies
AlanF_US
2019-04-07 13:42
Yes, I can imagine that a "minimum length" and a "maximum length" option would be useful. However, the nice thing about favoring sentences that meet a certain criterion rather than limiting them to that criterion is that if there are not enough sentences that meet it, you will automatically see the other ones without having to remove the criterion and do another search. I imagine that if I set a minimum and/or maximum length, I would often eliminate some of the fallback sentences I'd like to see, and then I'd have to do a follow-up search.

Indeed, I wouldn't want the default search to be optimized for my particular needs, which would be something like "favor sentences from five to ten words in length". However, I worry that optimizing it for choosing the shortest sentences would pessimize it for people like me, whereas leaving out a criterion of length would allow people to see a variety of sentences, short and long.
Thanuir
2019-04-08 14:48
If I could choose and there were no computational costs:

1. An exact match of the sentence.
2. Sentences with exact match of the query as part of it.
3. Sentences with all the exact words, but possibly in different order or with other words between them.
4. Sentences with all the words, but with stemming and the order might be different etc.
5. Sentences with all but one of the words (with stemming and could be in any order).
6. Sentences with all but two of the words (with stemming and could be in any order).
7. And so on.
8. Sentences with even a single searched word (with stemming).

Random order within the categories. (Some of the categories could be sorted into even finer subcategories, but probably not worth it.)

For example search: haluan kalastaa tänään
1. Haluan kalastaa tänään.
2. Minä haluan kalastaa tänään, niin kuin eilenkin.
3. Tänään minä haluan kalastaa. Haluan kuitenkin kalastaa tänään.
4. Haluankin kalastaa tänään. Haluatteko te tänään tai huomenna kalastamaan?
5. Haluatteko elokuviin tänään?
6-8. Karhut kalastavat lohia.

The idea would be to first have the precise phrase and then to have increasingly distantly related phrases, which hopefully would still give some understanding of the involved words.
hide replies
gillux
12 days ago - 12 days ago
@Thanuir @AlanF_US @CK

Thank you for your feedback. I agree about the relative uselessness of having very short sentences showed first. The idea of randomizing the results within a category, like Thanuir said, is appealing (giving the order is deterministic), but I’m afraid it could be a little bit confusing. I temporarily set up https://dev.tatoeba.org/ like that, please let me know what you think.

Or, if we are to rank using the number of words, what would be the ideal number? Not too long and not too short. It depends of the language of course. Here are some stats about the average number of words per sentence in every language on Tatoeba: https://gist.github.com/jiru/81...5917dc18325fc2

I wonder if we could use these numbers to boost the ranking of sentences having a number of words close to the average, with a formula like rank = –abs(average – words)
hide replies
AlanF_US
11 days ago
> Or, if we are to rank using the number of words, what would be the ideal number?

When I look at the lists of sentences that I've compiled for my own learning, the length of the sentences does tend to be pretty close to the average for the language (5+ for both Hebrew and Russian).
Thanuir
11 days ago
One issue with randomness is that reproducing problems or strange behaviour would be more difficult. Maybe displaying the sentences newest first would be an alternative that adds some amount of randomness while retaining reproducability?

...

Using the average number of words, as suggested by @gillux, would create an alternating pattern that @CK suggests, but without the initial emphasis on slightly longer sentences. It would not be terribly difficult to create a function that would have the type of behaviour that @CK wants when the absolute value in gillux's formula was replaced by it. The function would have to be a piecewise defined function with three linear pieces, or a more complicated one. I do not know how computationally expensive it is to deal with a piecewise defined function.

...

One thing I belive would be good would be to show sentences that contain most of the right words, but not all. For example, I tried searching for the English idiom "to talk through one's hat" using two queries: "talk through one's hat" and "He is talking through his hat." with no results. Presumably the sentences are not there. But if the sentence with "Tom" rather than "he" was there, or the sentences with she/her was there, I would not find it. Or the sentence with you or I.

The use case for searching for the idiom might be that I am trying to understand it or that I would like to translate it. Both would be helped by the search finding sentences which match only some of the words in the search query, but presenting them after the sentences with all the words.
hide replies
gillux
11 days ago
Yes, we definitely need randomness to be reproducible (and unpredictable, to avoid "rank boost threats") if this is the direction we’re taking. If I give you a search result URL, I expect that you see the same results as me, and that it stays more or less the same for a little time. I believe it is technically possible to produce a random, deterministic and unpredictable order.

I am concerned that boosting sentences having a number of words close to average is going to be detrimental to diversity, because it’s a incentive for contributors to produce standard-sized sentences. Isn’t there a risk of uniformization of the corpus? Or do we actually want more example sentences that are "efficient" and "standard"? I’d like to know @TRANG's opinion on that matter.

The idea of showing newest sentences first (after sorting by exact matches and LCS) is interesting. It surely adds some randomness, but it’s also an incentive to produce new sentences, and it gives more exposure to new or active users.

Since there is no consensus on an alternative to sorting by number of words, for the moment I’m going to change the default search ranking the way I described on the first post of this thread. We may further improve it later on.
hide replies
Aiji
9 days ago
I did not completely follow the conversation, but I'd like to answer the following

> I am concerned that boosting sentences having a number of words close to average is going to be detrimental to diversity, because it’s a incentive for contributors to produce standard-sized sentences. Isn’t there a risk of uniformization of the corpus? Or do we actually want more example sentences that are "efficient" and "standard"? I’d like to know @TRANG's opinion on that matter.

From my personal experience, that would definitely uniformize the corpus in its shape (not necessarily in its content). As you said, that's not so difficult to imagine that sentences with a number of words close to the average of a language would produce a huge amount of "standard-size" contributions.
For Latin languages for example, we would probably have something close to
Subject + Verb + Article + Adjective + Complement
and still from a personal point of view, that would provide SO many similar and boring sentences to translate, I'm pretty convince that would hinder my contributions. The most interesting sentences are CLEARLY NOT the one around the average. Well of course, you may encounter some nice expressions, or interesting words, but the vast majority would be "I borrowed a pen to Tom." instead of more elaborated, interesting sentences.
I'm always translating from the English corpus, and when I do a lot of translating at the same time, I always end by skipping several sentences because I feel like "AGAIN this sentence ?!" I can only imagine my feeling if the search would be biased to serve me more similar sentences... (I know English is special, but I guess the problem would be similar for the TOP 10 language at least).
TRANG
7 days ago
> I am concerned that boosting sentences having a number of words close to
> average is going to be detrimental to diversity, because it’s a incentive
> for contributors to produce standard-sized sentences.

The main factor for a diverse corpus is to have a diverse group of contributors, in my opinion. Next to that, the search ranking probably has very little impact on the kind of sentences that people create.

If contributors were paid every time their sentences are displayed in a search result, then I guess that would be a high enough incentive to produce sentences based on the ranking. But even then, unless they earn a living out of it, I think they will still naturally produce standard-sized sentences no matter the ranking because it's just easier to produce such sentences.

So no worries about influencing diversity here.

What we have to consider is: what is the default usage of the search that we're trying to cover?

My personal usage:
- I'm trying to figure out how to say something in a foreign language and I'm missing vocabulary or grammar knowledge.
- I saw a new word/phrase in a foreign language and I want to understand its meaning or see examples of how it's used.

For these use cases, shorter sentences are in general easier to analyze. So it makes sense to order by number of words.
But if the sentence is too short, it may be lacking context and may not be as useful as a longer sentence. So prioritizing average-sized sentences could make sense.
But average-sized sentences might not always be the most useful for everyone either. Randomizing the results also makes sense: it simply means we don't want to make assumptions about what size is "best".

Random order sounds appealing actually, but I wouldn't change to that until we gather more specific information about the issues of ordering by number of words.

I looked at the pageviews for the search in Google Analytics for the month of April.
- Pageviews with order=words: 12,990
- Pageviews with order=random: 8,339
- Pageviews with order=created: 1,198
- Pageviews with order=modified: 259
(Total pageviews for /sentences/search: 223,174)

It seems that when given the choice, people choose in majority to order by words.
hide replies
gillux
6 days ago
Thank you for the numbers, that’s valuable information.

This shows that 90% of the visitors making a search are using the "simple search" (top bar or front page), and 10% the advanced search (advanced search page or "more search criteria" block).

> It seems that when given the choice, people choose in majority to order by words.

However it’s not a fair choice because you can’t tell the visitors who made a choice (clicking on the dropdown, examining the choices and choosing) from the ones who didn’t (glancing over the dropdown or not even seeing it, and using the default value). order=words being the default, I believe it is overrepresented.

I find the number for order=random surprisingly high.
hide replies
AlanF_US
6 days ago
I agree.
Guybrush88
6 days ago
personally, I use order=random in the advanced search because generally I see from there more diversity (whether I use it for translating sentences or for tagging existing ones), since, in a general way, sorting by word order might present similar patterns (which I use whenever I want to translate or tag the same pattern)
hide replies
deniko
6 days ago
Same here.

I use the "fewer words first" mode when I use tatoeba to actually find a way to translate something, and the "random" mode when searching for sentences to translate.

If I need exact matches, I still use "fewer words first", but also the syntax like ="gesundheit"
TRANG
yesterday
I've extracted the stats since January 2018 to have a broader view on advanced search usage:
https://docs.google.com/spreads...it?usp=sharing

I would have thought that the order=random was high because it's the default option on the "Translate sentences" page, but it was high even before.

November 2018 is when we changed the default option on the "Translate sentences" from "created" to "modified" to "random" (cf. https://github.com/Tatoeba/tatoeba2/issues/1351). It's actually interesting to see how that influenced the order=created and order=modified.

I'm not sure why the "random" option spiked so much in March...

But in any case, there has been a few months where the advanced search was used more often with order=random than with order=words, even though order=words is the default.
hide replies
gillux
2 hours ago
Very interesting! Based on this data, I suggest that we ditch order=created and modify the "relevance" algorithm to randomize results within exact matches and LCS matches.
kemushi69
4 days ago
Reproducible randomness is certainly possible from a mathematical/cryptographic point of view. You do need to think carefully about whether it's possible to reverse-engineer personal information from generated pseudo-random (permalink) seeds, though.
CK
CK
10 days ago
Note that the "advanced search" on dev.tatoeba.org doesn't work properly now.

This search should show English sentences with the word "winter" sorted by fewest words.

> https://dev.tatoeba.org/eng/sen...io=&sort=words
hide replies
gillux
10 days ago
Thanks. I reverted it back to normal.
Ricardo14
4 days ago
I think it has already been pointed out but maybbe it's a good idea to point it again

The vocabulary feature is a great tool. It helps us to study languages by adding words in languages we don't speak fluently and want to see it used in contexts.

However, looking for https://tatoeba.org/eng/Vocabul...sentences/por, I've found words that don't exist in Portuguese like "decoy", misspelled words like "libro de cabeceira" (it should be "liVro") and also things like "Posso colocar aqui verbos usamos em nossos dia." (I can add here verbs we use every day), "Caso precisem de alguma informação com relação ao Português Brasil posso ajuda-los." (In case you need any information about Brazilian Portuguese I can help you)

http://prntscr.com/nrnhj9

It'd be good if we can correct these requests or even delete some.

https://tatoeba.org/eng/Vocabul..._sentences/eng (English)

Make it up - Помириться (Russian)
communicative language teaching (not a word)
That's what this is all about. - Это то из-за чего это всё. (Russian again)
щзхлзщхлщзх (Can't belive this is in Russian or any other language)

http://prntscr.com/nrnj8o
hide replies
Thanuir
4 days ago
I agree with the problem of the list being cluttered with typos and other mistakes.

I do think that for example "communicative language teaching" is completely fine on the list, though. One might very well understand all the component words and have example sentences with them without understanding the whole term or having it in any sentence.

With English this is especially bad, since the language uses few compound words; "language teaching" rather than "languageteaching", in marked contrast with many Germanic languages and Finnish, for example.

At least in scientific English there are many terms that refer to a concrete thing, but due to features of the language, they are written as a collection of separate words. "magnetic resonance imaging", for example, or "inverse problem".

There are also cases like the Danish "rejse" (to travel) versus "rejse sig" (reflexive form, to get up e.g. from a sitting position). It should be completely valid to have "rejse sig" as a vocabulary item.
raggione
4 days ago
Full supprt - I'd like to get my hands at the requests for German, which need to be sorted badly.
TRANG
3 days ago
Related issue on GitHub: https://github.com/Tatoeba/tatoeba2/issues/1473

To solve this, one possible approach is the one suggested in the GitHub issue: let corpus maintainers edit/delete any vocabulary items. This approach comes with a couple of problems however.

1) What if corpus maintainers edit/delete items that could have been valid?

This problem already shows up here in this thread, as you, Ricardo, said you would have deleted "communicative language teaching", while Thanuir would have kept it.

2) Users don't necessarily add vocabulary to get example sentences.

I can imagine that some users just want to make a list of words/phrases they want to learn and they find it useful to have the translations next to each word/phrase. That's probably why you have those items that contain both English and Russian words. Since the vocabulary feature doesn't allow to connect vocabulary between two languages, users will of course put the two languages into one item if they need to. They're not exactly doing something wrong, they're just adapting the feature to their needs. I feel that deleting or editing their items would not be the proper thing to do.

My suggestion would be to allow contributors to ignore items on the "Sentences wanted" page.
- Ignored items would no longer be displayed on the "Sentences wanted" page.
- Ignored items would be individual, which means that if I ignore an item, it is ignored only for me. If other people don't want to see an item, they would have to ignore it themselves.

In parallel to that, we could split the vocabulary items into two categories:
- A "regular" list where users just create a vocabulary list for themselves.
- A "needs sentences" list where users add vocabulary for which that they explicitly want example sentences. Only vocabulary items in this list would appear on the "Sentences wanted".

Those are my personal ideas. There is no clear decision yet on how we want to handle the "bad" vocabulary items and if you have better ideas, please do share.
hide replies
gillux
3 days ago
To me, the root of the problem is that members are unable to put the correct information in the "add vocabulary items" form. And no one can blame them for that. The process of turning an "I want sentences like that" idea into correct values for "language" + "searchable vocabulary item" is indeed difficult, and probably unclear at first. Members don’t exactly know that their vocabulary items are going to be interpreted as search queries, with the number of results displayed, all that listed on a public page. I can see some members are adding whole sentence pairs as vocabulary items, or a semicolon-separated list of items. I can imagine some members assume that what’s under 'my vocabulary items' is only visible to them. I guess some members do not care that much about the flag because they know the language already. After all, when writing down your own vocabulary list, does anyone bother writing the language name?

That is quite of a UX challenge in my opinion. I have no great idea to solve it but there is certainly room for improvement. I’m pretty sure most people do not read the bottom-right "Tips" block that says "Add vocabulary that you are learning. If your vocabulary does not exist yet in Tatoeba, other contributors can add sentences for it."

I don’t think a per-member black list would help. I can make out two categories of items people would like to hide: (1) items that are not meant to achieve 10 sentences (because they are garbage, completely invalid or personal) and (2) items that won’t achieve 10 sentences because of incorrect values. As for category 1, everyone wants to hide them from the public list. As for category 2, they are valid requests that anyone can be interested in, it’s just that Tatoeba isn’t smart enough to show the correct number of sentences.

Splitting the vocabulary items into personal vs. wanted sentences is an interesting idea, but in the end I don’t think it will solve the problem of having "forever 0 sentences" items cluttering the list.

Some other ideas:
How about allowing members to edit the language of their own items?
How about allowing corpus maintainers to edit the language of other members’ items?
How about somehow dividing the vocabulary items into two categories: the ones that can be readily used as search queries, and the ones that cannot. For the ones that cannot, instead of having a link to a search, we allow to directly add sentences that are assumed to match the item and are counted as such.
hide replies
TRANG
yesterday
> That is quite of a UX challenge in my opinion.

Oh yes, and we definitely won't be solving everything at once...

> How about allowing members to edit the language of their own items?

Yes, we should. There's an issue about this: https://github.com/Tatoeba/tatoeba2/issues/1238

> How about allowing corpus maintainers to edit the language of other
> members’ items?

There's the problem that corpus maintainers may have wrong assumptions about the language of the item.

> How about somehow dividing the vocabulary items into two categories: the ones
> that can be readily used as search queries, and the ones that cannot. For the ones
> that cannot, instead of having a link to a search, we allow to directly add sentences
> that are assumed to match the item and are counted as such.

I think this issue is related: https://github.com/Tatoeba/tatoeba2/issues/1281


Another idea: we could exclude old items from the "Sentences wanted". If an item has been added more than 30 days ago (could be more, could be less, could be customizable by the user who is adding sentences), we don't display it anymore. It would show up again only if another user would add the item, or if the same user would re-add the same item.
hide replies
Thanuir
yesterday
Could the time limit be keyed to the activity of the user, instead? If a user has not contributed to the site for a year, no longer display their wanted vocabulary. Otherwise, one would have to continuously upkeep their vocabulary for it to survive, which sounds like busywork or a discouragement.

I would also suggest that one month is far too short. If a word is simple, sentences might be added in a month, but they might also be there already. Most legitimate words that linger for long would presumably be obscure or in languages where people do not actively contribute sentences to wanted words. In this case, even a year sounds like a short period of time.
hide replies
TRANG
16 hours ago
> Could the time limit be keyed to the activity of the user, instead? If a user has not
> contributed to the site for a year, no longer display their wanted vocabulary.
> Otherwise, one would have to continuously upkeep their vocabulary for it to survive,
> which sounds like busywork or a discouragement.

Well, we would also need to take into account the contributor's point of view (and by "contributor", I mean the person who is creating sentence based on vocabulary). If they have no idea what to do with a vocabulary item (because they're not inspired, or because the item is incorrect) and that item stays on the "Sentences wanted" page for until one year after the vocabulary owner becomes inactive, that would be pretty long.

I get your point though. Having to actively "bump" your vocabulary requests every once in a while can be annoying.

We could implement the time limit as an option rather than a default. By default we could still display vocabulary items of all dates, then the contributor can choose to only show the more recent ones if there are too many undesirable items.

But I'm starting to wonder if the approach of displaying vocabulary items is the best approach. Perhaps we should instead display users who are requesting vocabulary so that the focus is more about connecting members with each other. It would be more engaging and, I feel, more efficient for getting rid of "bad" vocabulary items.

Since I would browse the items of a specific user, I would know who to contact to let them know that something is wrong in their vocabulary list, and they can correct it. And if they never correct it, it probably wouldn't bother anyone because the items would be "isolated" in that user's vocabulary list, as opposed to being mixed up with everyone else's items like it is now.
hide replies
Thanuir
9 hours ago
I use the vocabulary feature both ways. In case a user case in interesting:

1. When reading a text and encountering a word I do not understand, I sometimes add it to my vocabulary. (This depends on the context where I am reading, more than anything else. How much time I have, for example.) Likewise, if there is a word I will or would like to use, I add it to my vocabulary. When adding something to my vocabulary, I also check the sentences that already are there that the built-in search finds.

I very occasionally go through my vocabulary, removing words that I am satisfied with, and checking the entries Tatoeba finds with the search for those terms I do not yet quite understand.

2. As a contributor, I sometimes go through the wanted Finnish words and add sentences to those. The problem is that I seem to be the only one doing this with any regularity, so I tend to add similar sentences several times. Sometimes I come up with a clever idea for a sentence and find that it has already been added, often by me. There are several wanted Finnish words, so I can just skip the ones that I find hard, or occasionally I do a little bit of research to find how the word is actually used or to find out some facts about it, or even what it precisely means if the word is obscure enough.

I do not think there are any wrong Finnish vocabulary items at the moment. If there were, I would just have to ignore them, or add them once to a sentence that points out the misspelling or misunderstanding. Adding ten such sentences to get the word out of the vocabulary list would not be inspiring at all. But a single sentence would be fine.

...

In my experience, adding words to the vocabulary is like putting a postcard in a bottle and throwing it into the sea - maybe someone will find it and add the word at some point, but there are absolutely no guarantees, and it never happens fast.

Languages: Mostly Danish and both Norwegians, occasional scientific or obscure English, couple of obscure Finnish words, and maybe something else.

Likewise, the list of requested Finnish vocabulary seems to be fairly static, so I consider it as a source of inspiration and a puzzle game, as well as a way of getting to know some obscure words, rather than as actively helping someone in the moment. Actively helping someone would be nice, though, but it also works as is for me.
jegaevi
20 hours ago
Kedves magyar tagok!
Az utóbbi 15 napban összesen 10180 mondatot olvastam át. Sokhoz írtam hozzászólást. Terveztem, hogy átnézem az összes mondatot, hogy a lehető legkevesebb hiba legyen a magyar mondatok között. Ezzel senkit nem akartam bántani, zaklatni, minősíteni. Tényleg sajnálom, ha valaki ezt negatívan élte meg. Nem tudom, hogy a továbbiakban érdemes-e ezt folytatnom, mert őszintén, nagyon elment tőle a kedvem. És úgy látszik, hogy mások is csak zaklatásnak veszik, és tiszteletnek ítélnek meg miatta. Inkább maradok a fordításnál és audio feltöltésnél. Bocsánatot nem fogok kérni senkitől, mert úgy gondolom, semmi olyat nem tettem, ami ezt megindokolná. Én továbbra is nagyon szívesen fogadom a hozzászólásokat a mondataim alatt, sőt nagyon örülök, hogy valaki veszi a fáradtságot, hogy kijavítson, vagy csak javaslatot tegyen.

További jó Tatoebázást kívánok mindenkinek! :)
hide replies
maaster
18 hours ago
Én örülök az erőfeszítéseidnek, és részemről nem gond, ha írsz a mondataimhoz megjegyzéseket. Azonban nem vagyunk egyformák, egyesek stílusa nem túl megnyerő, morcos medve stílus - sőt szentnek, sérthetetlennek és tévedhetetlennek gondolják magukat; soha nem fogják megköszönni vagy esetleg válaszra méltatni az észrevételeidet.
A lényeg, hogy örömödet leld abban, amit itt csinálsz, és tanulj belőle.
Etvreurey
yesterday
► Over 900 Hungarian sentences (of mine) without audio, not yet translated into any other language ◄

https://tatoeba.org/eng/sentenc...&direction=asc

Perhaps you would like to translate some of these into your native language.

https://tatoeba.org/eng/sentenc...=&sort=created
hide replies
maaster
yesterday - yesterday
Jól nyomod! Tanultál marketinget, ugye? (Ha szobrászatot nem is.)

A füves mondatodon majd behaltam.

(#7930398)
hide replies
Etvreurey
yesterday
Hát, az úgy van, hogy azt se tanultam...

A lego óta nem megy olyan jól ez a torpedó. :)

#7932545
CK
CK
29 days ago - 28 days ago
** New Hungarian Voice **

jegaevi contributed over 300 Hungarian audio files today.

https://tatoeba.org/eng/sentenc.../show/8977/und

[EDIT - 11 hours later]

I just uploaded another 250 audio files by her, so now there are 556 audio files.
hide replies
maaster
29 days ago
Great. / Szuper!
mraz
28 days ago
CK
CK
yesterday - yesterday
Since she started contributing audio less than a month ago, Jegaevi has contributed 2,203 Hungarian audio files.

https://tatoeba.org/eng/sentenc.../show/8977/und

Hungarian is now in the 8th place for sentences with audio;

https://tatoeba.org/eng/audio/index/hun

You can see all the contributors of Hungarian audio files and how many they have each contributed with this link.

https://tatoeba.org/eng/sentenc...udio%20-%20hun
CK
CK
3 days ago
** Over 13,000 English sentences with audio, not yet translated into any language **

https://tatoeba.org/eng/sentenc...filter=exclude

Perhaps you would like to translate some of these into your native language.
hide replies
deniko
3 days ago - 3 days ago
I think there are even more of those.

Your search string doesn't find sentences like this one:

#7521148

A screenshot in case anyone translates it and spoils my example:

https://i.imgur.com/XAnea9d.png

They're also not translated into any language, the English "translations" are just slight grammatical or spelling variations with no change in meaning or even connotation.
Thanuir
3 days ago
This query finds a few more: https://tatoeba.org/fra/sentenc...=&sort=created , but has the same problem that deniko mentioned.

Though there is little reason to worry about not finding all the sentences, here, as there are plenty already.
EoghanM
6 days ago
[this rant needs to be prefaced with my thanks and understanding of the difficulty in designing and maintaining a software project of this complexity on a part time basis]

I've just gone through a random 20 or so sentences and found 2-3 errors that have been highlighted in the comments in the past (both in terms of easily verifiable spelling errors, and also links between non-direct translations). Sometimes the comments are 6 or more years old.
I feel like this site is far too conservative in not allowing people to fix up other people's mistakes or omissions.
I understand that there is a danger of a well meaning but uninformed newcomer from vandalizing a load of content, but it's very disheartening to not be able to fix obvious things.
Also I don't think the concept of links between sentences is all that mysterious, and some more power should be given to normal contributors in this regard.
I.e. as a new member, it's possible for me to link two sentences together by adding a translation which matches an existing sentence, so why should I not also be trusted to unlink sentences from each other?

[end rant]
hide replies
Thanuir
5 days ago
There is a similar, to me surprisingly strong, take on authorship at some other communal websites. Maybe most people are not willing to have others edit their contributions with impunity? There clearly needs to be some restrictions to stop vandalism, in any case.

Getting the right to add and remove links can happen fast. Just translate or add sentences to your strongest language for a while and then ask for the privileges: https://en.wiki.tatoeba.org/art...d-contributors

With smaller languages it might be that there is no highly active user with privileges, which will slow everything down. With bigger languages, the number of active users might be too small. You can at least comment and maybe mark sentences as OK / unsure / not OK. If you later acquire privileges, then you can go through your commented sentences or the sentences you have marked in a particular way, or sentences in a particular language marked in a particular way, and do the clean-up.

If you add a sentence identical to another, those two will be merged and all translations of both sentences will remain as translations.
hide replies
EoghanM
5 days ago
> Just translate or add sentences to your strongest language for a while and then ask for the privileges

My strongest language is English and I'm noticing errors in a secondary language, so I don't think that is achievable; what I'm looking for is 'gardening' functionality.

> maybe mark sentences as OK / unsure / not OK
Yes, this is a great function, but there is there any feedback when someone resolves your 'OK/Not OK' (if ever)?
I don't think there similar functionality to draw attention to a link/relationship.

Also, adding tags should be opened up for any member, no?
hide replies
AlanF_US
5 days ago
You can always leave a comment on a sentence to point out problems with it, even if the sentence is not in your strongest language.

If you mark a sentence "OK", "Unsure", or "Not OK", and then the sentence is modified, the marking will then be displayed with "(outdated)" or a similar string. You can always change or remove your marking later, though you won't get a notification reminding you to do it.

The tag system is open to abuse. Requiring people to stick around the site for a while and explicitly ask for the next level of privilege before they can add tags tends to reduce the amount of abuse.
Thanuir
5 days ago
> My strongest language is English and I'm noticing errors in a secondary language, so I don't think that is achievable; what I'm looking for is 'gardening' functionality.

You can get the gardening functionality by being an active contributor for a while. Hence the suggestion on what is the easiest way to do that.

This search, for example, should have pretty much every Irish sentence without a direct translation to English: https://tatoeba.org/eng/sentenc...o=&sort=random

Maybe you could translate those?

Another fruitful thing would be to add sentences in English that others are unlikely to add. Maybe sentences particular to the Irish culture or geography, or maybe sentences related to your expertise (your particular hobbies, work, education, etc.). Another good way of contributing is to check vocabulary people want to see more of, https://tatoeba.org/eng/Vocabul..._sentences/eng , and add sentences that use those words.

People are typically discouraged from contributing in non-native languages, but contributions in endangered (or contructed or dead) languages are less frowned upon or happily accepted. I do not know enough about the situation of Irish to comment further on this.

...

The link https://tatoeba.org/eng/collect...ghanM/outdated should contain your outdated ratings. I think a rating becomes outdated if the sentence is edited.

...

The tags were, at least originally, meant to contain objective information about the sentence; see https://blog.tatoeba.org/2010/1...uidelines.html . However, they might not be curated effectively at the moment.

Lists are a way of collecting arbitrary sentences in an arbitrary connection. I have not used them much; maybe someone else can tell you more.
hide replies
EoghanM
5 days ago
> should have pretty much every Irish sentence without a direct translation to English

Thanks! Have worked through a few of those.
hide replies
kemushi69
4 days ago
Go n-éirí an bóthar leat, a hEoghan.
(Happy Trails, Mr. Eoghan)
AlanF_US
4 days ago
> Lists are a way of collecting arbitrary sentences in an arbitrary connection. I have not used them much; maybe someone else can tell you more.

For example, lists can be used to:
- import sentences into a flashcard program like Anki
- collect sentences that you would like translated into another language
- collect sentences in a language that you would like Tatoeba to support
- collect sentences that you want to process on your own computer

One thing that makes lists useful is that you can download them (though currently this is restricted to lists containing 100 or fewer sentences).
Aiji
5 days ago
> Maybe most people are not willing to have others edit their contributions with impunity?

As they should be. "Obvious mistakes" is a very dangerous concept. Some people might consider something they've never heard in their life an obvious mistake even before asking the author. For inactive users, that's a little bit different, but as an active user, I would be quite upset if someone were to correct my sentences without noticing me.

> it's very disheartening to not be able to fix obvious things.
That's a beginner feeling, trying to do everything by themselves before thinking about the community behind.
If I have a problem with an English sentence, I can contact one of many people.
If I have a problem that need an admin privilege (privilege that I don't have), I contact an admin.
etc. And there's no need to worry about the 30 seconds you would save by doing it by yourself ^^
hide replies
Thanuir
4 days ago
I think that overeager users would cause some damage, but the overall effect on quality would be good.

However, the problem would be malicious users. What if someone created, say, a bot that would replace words in a language with an obscure script with "Pepe the Frog" written in that script? It might take quite a long time before anyone noticed anything. Or started making obscure mistakes or removing or adding single letters to words. Etc. There would have to be a highly convenient way of reverting such edits en masse over vetting them.

I think human oversight in giving people editing privileges is necessary, given the highly fragmented community and large corpus.

...

However, the real problem here is that there are many housekeeping (or gardening) tasks that are not done reliably. This is doubtless true of small languages that do not have sufficiently active native contributors, but also might be the case with English due to the large volume, I guess.

Leaving comments on sentences is not satisfactory, because they might very well be missed. Tags such as @change and NNC are better, since they are task queues. But I meet Norwegian bokmål sentences with @NNC all the time, because there is at least one prolific none-native contributor, but no sufficiently active native ones, at least with privileges. There are lots of unadopted English sentences. Some tags are not objective designators of facts. The same tag can exist in multiple versions.

Is it possible to view all sentences, starting with the ones with the most "not OK" tags? This would also work as a nice queue of sentences requiring actions, and would be something even a newcomer could contribute to by marking sentences.
EoghanM
4 days ago
>> > it's very disheartening to not be able to fix obvious things.

> That's a beginner feeling, trying to do everything by themselves before thinking about the community behind.

@Aiji I think you missed the point I was trying to make in that there were people who had pointed out the problem with the sentence 6+ years ago in the comments and nothing was done.
So the disheartening part includes the aspect that there's no point in commenting either.
hide replies
AlanF_US
3 days ago
The fact that someone pointed out a problem years ago and got no response does not mean that it's useless to post a comment now. Something might have changed in between. For instance, perhaps there was no corpus maintainer for the language then, but there is now.

If you post a new comment, it will show up at the top of the most recent comments. Also, if you can figure out the name of a corpus maintainer for the language, you could try adding their name to the comment, preceded by an at-sign (@). They will then get a notification.
gillux
13 days ago
Dear Tatoebians,

Recently I have been working on improving the export capabilities of Tatoeba. I created the necessary code base to provide customized exports (as opposed to our generic exports of the Downloads page). In the future, I plan to allow various kind of exports, like all sentences of a specific language, of a specific user, having a specific license, sentence pairs… But for now, I just started with implementing list exports.

You can check this out on our development website https://dev.tatoeba.org/. Log in with the same credentials as here, and then go to the page https://dev.tatoeba.org/exports/index. From there, you should be able to export and download any list you have access to, no matter how big it is.

Feedback is very welcome.
hide replies
Guybrush88
13 days ago
I tried with some lists, including one containing almost 400 sentences, and it worked properly.

In the meanwhile, I have a few suggestions:

1) I'd like to see the ID of sentences. For example: I download some lists for whatever reason, and, if I notice any mistake in a sentence and I want to report the mistake on Tatoeba, I would personally find it quicker to copy and paste the ID rather than the sentence.

2) I would find a delete button useful. If I download something and then I notice it isn't the thing I wanted to actually download, I'd like to have the opportunity to delete it, since this would avoid any confusion to the user if he/she has the need to download many things, and, not less important, doing this would free some space on the server.

3) What about mentioning the license of the sentences contained in lists? Maybe this would be helpful if someone wants to reuse them.
soliloquist
13 days ago
I tried downloading some lists containing more than 1,000 sentences and it worked fine. The only drawback is that exported lists are monolingual.

That 100-sentence limitation was a problem. Thanks for working on it.
Guybrush88
6 days ago
I just saw that the download button is truncated when the list name is long: http://i63.tinypic.com/15s3cih.png
Fructo
8 days ago
I want to know your personal profiles.
What kind of people are you?
Who are you?
Why do you spend so much time om tatoeba?
I'm sure you're nice people. But why not tell more about yourself?
hide replies
Objectivesea
7 days ago
Those Tatoeba contributors who choose to say something about themselves will often do so in their individual profile pages. For example, my profile can be seen at <https://tatoeba.org/epo/user/profile/Objectivesea>

However, when I viewed your profile, Fructo, it says nothing about what kind of person you are, who you are or why you spend time on Tatoeba. I'm sure you could modify your profile to set an example for the rest of us as to the sorts of information you think these profiles should contain. Currently, all we are able to learn about you is that you are supposedly six years old, which sounds implausible on its face.
hide replies
Fructo
7 days ago - 7 days ago
There's no your profile. You're posting dead links.

Also, the birthdate on my profile is fake, I was lazy to indicate the real one.
hide replies
Thanuir
7 days ago
There is a typo in the link; just remove the final ">" or click on Objectivesea's user name next to the profile picture.

Also, if you are too lazy to indicate your real birth date, why would you expect others to answer your questions? I'm sure you'll get to know the users here if you just participate actively. Like in other volunteer efforts, you get goodwill by contributing.
hide replies
Fructo
7 days ago
How about your goodwill? You don't look having some good will placing all random symbols everywhere where they're not appropriate.
I hope you'll learn some better spelling.
After that, you can talk to other people.
Fructo
10 days ago - 10 days ago
Hello. I still need an explanation, please.
What are the useful outcomes of this project?
How anyone could use this project to draw something useful from it?
What are useful applications of it?
Thank you.
hide replies
CK
CK
10 days ago
Various projects use the data we create here.

http://bit.ly/tatoebalinks

hide replies
brauliobezerra
8 days ago
I put the typing practice game back online at http://type.braul.io .
hide replies
sabretou
8 days ago
This is great! Could you also update the list of languages to what Tatoeba supports today?
hide replies
brauliobezerra
8 days ago
Done!

A lot of sentences were not imported, though (some things sqlite importer could not interpret as valid csv). And thus some languages may have no sentences. I'll take a look at this later.
Thanuir
10 days ago
In addition to what CK mentions, Tatoeba, by itself, works as a multilingual dictionary (though mostly of common words).

It has also been used for at least some research in computational linguistics; this Google scholar search should be a good starting point: https://scholar.google.no/schol...atoeba+corpus"