clear
{{language.name}} 言語が見つかりません
swap_horiz
{{language.name}} 言語が見つかりません
search

掲示板(スレッド数:5,768)

ヒント

質問をする際は、必ずあらかじめよくある質問をお読みください。

なお、Tatoebaは文明的な討論を行うために健全な雰囲気を維持することを目指しています。悪質な行為に対するルールも併せてお読みください。

最近の書き込み subdirectory_arrow_right

gillux

1時間前

subdirectory_arrow_right

Ricardo14

1時間前

feedback

gillux

1時間前

subdirectory_arrow_right

Aiji

2時間前

subdirectory_arrow_right

CK

4時間前

subdirectory_arrow_right

AlanF_US

5時間前

subdirectory_arrow_right

lbdx

11時間前

subdirectory_arrow_right

AlanF_US

12時間前

subdirectory_arrow_right

lbdx

14時間前

subdirectory_arrow_right

AmarMecheri

1日前

lbdx lbdx 5日前 2020年3月28日 16:10 link permalink

To be relevant, a corpus of example sentences should cover the words that are of most interest in each language.

However, even in English, Tatoeba is still far from achieving this. For example, the word 'compliance' is one of the top 50 most searched for words on the Linguee online dictionary but only has 6 sentences on Tatoeba. These sentences are only translated into seven languages and none of them into languages as important as Spanish, Portuguese or French. At the same time, 'dog' is only the 9357th most searched word on Linguee but has more than 5000 sentences on Tatoeba.

So it seems that we contributors, spend too much time creating and translating sentences that will never be read by non-contributing Tatoeba users. On the other hand, we probably don't pay enough attention to the words that are the most requested such as 'relevant', 'scope', enhance' or 'furthermore'.

Perhaps a system that better balances the supply and demand of sentences could be set up. Starting from the most searched words on Linguee or Tatoeba, we could suggest the most underused words to contributors who would like to maximize their impact on the corpus enhancement. For example, for a given language, we could highlight the 100 most requested words that do not have at least 20 sentences translated in the ten most important languages of the corpus.

Has a similar proposal ever been debated? Do you think it would be worth implementing?

Data source : https://www.linguee.com/english...ish/1-200.html

{{vm.hiddenReplies[34645] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
AlanF_US AlanF_US 5日前 2020年3月28日 16:33 link permalink

I think that it would be quite worthwhile to find a way to cover frequently requested but underused words. However, I believe that it would be best to do this through coordinated communication rather than by trying to change the site's software. It takes a long time to agree upon and implement software changes. Furthermore, if this initiative is successful, it will remove some of its own basis for existence, another reason we wouldn't want it "baked into" our software.

CK CK 4日前, 編集 4日前 2020年3月29日 10:32, 編集 2020年3月29日 10:36 link permalink

If you want to easily try searching for these 200 words and phrases, I've set up a page for you to easily do so.

http://tatoeba.ueuo.com/linguee/

You can get additional links on the page by entering your own native language's code at the end of the following URL, instead of "fra".

http://tatoeba.ueuo.com/linguee/?t=fra

lbdx lbdx 4日前 2020年3月29日 12:14 link permalink

By scraping tatoeba.org, I identified the 20 most highly searched words on Linguee that are underused on Tatoeba (less than 20 sentences including the given word).

linguee_rank word nb_sentences
17 scope 19
23 compliance 6
44 leverage 12
54 furthermore 11
85 default 14
103 retention 2
107 facilitate 10
141 procurement 16
150 pending 13
205 stakeholder 2
207 vendor 16
259 invoice 17
266 alignment 11
268 incentive 15
277 framework 17
281 align 11
282 equity 7
290 gauge 19
294 venue 7
295 mitigate 12

Maybe CK could build a table from this data so that users can easily add new sentences and translations ?

If it gains traction, we could update this list on a weekly basis.

{{vm.hiddenReplies[34651] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Yorwba Yorwba 4日前 2020年3月29日 12:46 link permalink

Note that it's possible to get all sentences as a single file from the downloads page https://tatoeba.org/eng/downloads and thanks to rumpelstilzchen's work it's now also easy to only download sentences in a certain language.

"Scraping tatoeba.org" sounds like you wrote a script, so using that file might be a more comfortable way to collect frequency statistics.

{{vm.hiddenReplies[34652] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
lbdx lbdx 4日前 2020年3月29日 12:59 link permalink

Good point. On the other hand, by scraping the data, you benefit from the Tatoeba advanced search based on text stemming right away.

{{vm.hiddenReplies[34653] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
CK CK 3日前, 編集 3日前 2020年3月30日 13:35, 編集 2020年3月30日 13:35 link permalink

You'll notice that "scope" actually has more than 19 sentences.

You forgot to include the orphans.

https://tatoeba.org/eng/sentenc...es&query=scope

Orphans are excluded automatically on the assumption that they are less trustworthy than owned sentences, even though some are likely better than some non-native owned sentences.

Orphans are included in the exported data.

{{vm.hiddenReplies[34668] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Thanuir Thanuir 3日前 2020年3月30日 13:57 link permalink

Haluaisiko joku englanninkielinen adoptoida, kenties muokkaamisen jälkeen, loputkin näistä orvoista?

AlanF_US AlanF_US 4日前 2020年3月29日 14:26 link permalink

A few things to consider:

(1) While words with many different senses, like "set", need a wide variety of sentences, even ten sentences is pretty good for getting an idea of how to use a word like "mitigate", as long as the sentences are not near-duplicates of each other. Of course, there are a lot of near-duplicates in our sentences, and the number keeps increasing.

(2) The stemmer misses some correspondences. Thus, a search for "retention" misses "retain", and a search for "retain" finds 36 hits.

{{vm.hiddenReplies[34654] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
lbdx lbdx 4日前 2020年3月29日 15:14 link permalink

(1) It is true that it is difficult to define a threshold suitable for both monosemic and polysemic words. It may be preferable to start with a low threshold at 10 and then raise it to 20 in a second round.

(2) If the stemmer is imperfect, it is probably better to use the exact match. This is the option chosen on this page: https://tatoeba.org/eng/Vocabul..._sentences/eng . Another advantage is that it makes it easier to process the data in the script.

With these parameters, we get the following list for English:

23 compliance 7
44 leverage 9
103 retention 1
107 facilitate 7
141 procurement 1
205 stakeholder 1
206 comprise 9
228 evaluation 9
266 alignment 4
268 incentive 9
281 align 4
282 equity 8
294 venue 5
295 mitigate 2
299 liability 9
314 preliminary 6
321 hub 9
337 offset 2
343 amend 4
344 retrieve 9

gillux gillux 4日前 2020年3月29日 14:53 link permalink

I think Linguee and Tatoeba are too different projects to be directly compared.

You are talking about "words that are of most interest". But to who? It depends on the person, on your mother tongue and culture, on what you are looking for etc. Does having more people querying a word really make that word more important? (open question)

You are talking about "languages as important as Spanish, Portuguese or French", but what makes them so important? Again, it depends on the person, on your mother tongue and culture, on what you are looking for.

Linguee is good at answering the demand of the market of translations, but that is a specific and restricted view on languages. For example, Linguee will help me a lot if I have to translate legal papers, but it won’t help me if my job is to create subtitles. Tatoeba has a broader approach that makes it, by design, less efficient on specific fields.

If I am a translator working with English, Spanish or French, I’d rather use Linguee than Tatoeba to help me with a translation job. In contrast, if I am a beginner learning Esperanto, or even Portuguese, I’d rather use Tatoeba. Or if I’m looking for sentence pairs I can legally re-use in my online courses or learning app, I’d rather use Tatoeba.

Note that I am not trying to denigrate your comment. To the contrary, I think you are raising important questions. I just want to show that you (as well as everyone, including me) have a bias about what is "relevant" or "important" or "should be on Tatoeba".

{{vm.hiddenReplies[34655] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
lbdx lbdx 4日前, 編集 4日前 2020年3月29日 15:30, 編集 2020年3月29日 16:43 link permalink

You are right, Linguee is different from Tatoeba. If I used Linguee's rankings, it's for lack of a better one. Of course, if Tatoeba was mainly used to translate Estonian subtitles into Esperanto, then more Estonian sentences would have to be created. ^^

By the way, it would also be interesting to know which words are the most searched for on Tatoeba. Is such data compiled somewhere?

In any case, the vocabulary related to the business world is not very covered by the contributors, and this is detrimental to the completeness of the corpus. I am pretty confident that for a developer "looking for sentence pairs I can legally re-use in my online courses or learning app", exhaustiveness or at least large diversity is not a detail.

{{vm.hiddenReplies[34658] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
gillux gillux 1時間前 2020年4月3日 5:32 link permalink

We have logs of all queries made to the search engine up to one year ago, but it’s just raw data and it’s not publicly available. I can anonymize this data and put it online if you want to play around with it.

CK CK 4日前 2020年3月30日 0:37 link permalink

From an English learner's point of view, lists such as the NGSL are likely much more useful.

http://www.newgeneralservicelist.org/

Here is an easy way to browse sentences from the Tatoeba Corpus in 2014.

http://www.manythings.org/sentences/words/

At the time I built this set of pages, List 907 had a little over 300,000 sentences and 123,278 of these had audio.

There are now over 750,000 sentences on List 907 and over 490,000 English sentences with audio. My plan is to rebuild these pages after I get to 500,000 English sentences with audio.

{{vm.hiddenReplies[34665] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Thanuir Thanuir 3日前 2020年3月31日 5:29 link permalink

Eri kohdeyleisö, luultavasti. NGSL luultavasti aloittaa yleisimmistä sanoista (jotka on helppo oppia mistä tahansa), sinnä missä aloitusviestissä mainittu lähde näyttää käsittävän sanoja, joiden merkitys ihmisille on epäselvä. Oletettavasti niistä kiinnostuneet ihmiset ovat jo päässet aloittelijasanojen ongelmista ohi.

{{vm.hiddenReplies[34683] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
CK CK 3日前, 編集 3日前 2020年3月31日 6:20, 編集 2020年3月31日 6:26 link permalink

I'm not exactly sure what you're saying since machine translation doesn't always correctly translate things, but perhaps you missed several of the other lists on that site, and the fact that my 2014 set of pages listed some of the upper-level word families numbered-sequentially after the main general list word families.

NEW ACADEMIC WORD LIST
http://www.newgeneralservicelis...emic-word-list
TOEIC WORD LIST
http://www.newgeneralservicelist.org/toeic-list
BUSINESS SERVICE LIST
http://www.newgeneralservicelis...s-service-list
NGSL-SPOKEN
http://www.newgeneralservicelist.org/ngsls


Briefly, for those who don't want to read the whole NGSL website...

A person mastering the 2,368 "word families" (not just single words) should be able to understand up to 92% of what they read.

A word family is something like this.
friend, friends, friendly, unfriendly, friendless, friendship

If you also can master the additional 960 word families on the NAWL, you should be able to handle college textbooks, academic journals, etc. About 92% coverage, since you will also need to learn vocabulary specific to your field of study.

The Business Service List is similar to the NAWL, except it's aimed at business English. An additional 1,700 "word families" over the 2,368 on the NGSL should give you about 97% coverage.

The main purpose of such lists is to help people studying English focus on the basic "word families" that are most frequently used, so people can get to the point where they can understand a very high percentage as quickly as possible.

(This is just a brief explanation. Read the website for more details.)

{{vm.hiddenReplies[34684] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Thanuir Thanuir 2日前 2020年3月31日 6:49 link permalink

Selvä. Vilkaisin sivustoasi ja siinä näkyi pääosin yleislistan sanoja. Kyllä ne muutkin listat sieltä näyttävät löytyvän, mutta niidenkin sanasto näyttää aika yleiseltä, mikä lienee niiden tarkoituskin.

gillux gillux 4日前 2020年3月29日 14:55 link permalink

And of course, you are welcome to add more sentences based on your idea of what words should be better covered on Tatoeba. ^^

{{vm.hiddenReplies[34656] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
lbdx lbdx 4日前, 編集 4日前 2020年3月29日 17:04, 編集 2020年3月29日 17:26 link permalink

That's what I started doing.
With this suggestion, I'm just trying to contribute by publishing landmarks for those who lack inspiration and would like to make sure that they create or translate sentences that people want to read.

{{vm.hiddenReplies[34659] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Aiji Aiji 4日前 2020年3月30日 0:35 link permalink

Just two points:
- You could add vocabulary items for words that you want more represented. Right now, the vocabulary feature is difficult to use efficiently (but it will change!) but that way you can directly follow the number of sentences for each of the words you're interested in and see what sentences exist. While your limit was 20, the limit for "sentences wanted" is 10, meaning that words appearing in more than ten sentences will not appear on the "wanted" page.

- Please do not add sentences from Linguee directly ^^ (except those whose source is compatible with Tatoeba)

{{vm.hiddenReplies[34664] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
lbdx lbdx 3日前 2020年3月30日 8:20 link permalink

I'm glad you mention this feature and learn that it will be improved soon. I'm already using it and I like it a lot. While browsing it for different languages, I noticed that it could perhaps be improved in two ways.

- Sometimes the requested expressions are formulated in such a way that it is impossible to answer the request: spelling mistakes, use of the equal sign to specify the meaning of the word... By limiting the duration of the request, these complicated expressions could be evacuated naturally.

- I also noticed that for some languages, this feature is not very used. Maybe the lists could be automatically filled with the underused expressions searched on Tatoeba at least x times during the last y days by different users. This option would have the advantage to better connect simple users with the corpus contributors and to be applicable for all languages.

lbdx lbdx 14時間前, 編集 14時間前 2020年4月2日 16:14, 編集 2020年4月2日 16:16 link permalink

Thanks to your feedback, I have created a new tool that can be found at https://tatominer.imfast.io.

It will help contributors who want to diversify Tatoeba's vocabulary while paying special attention to expressions that are often searched for by users of bilingual dictionaries.

{{vm.hiddenReplies[34698] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
AlanF_US AlanF_US 12時間前 2020年4月2日 18:29 link permalink

This is nice. Would it be possible to provide a drop-down to select the language? Most people will probably only be interested in one language at a time. While it's possible to sort by language and then page to the place where sentences in the desired language starts, it's a cumbersome process.

{{vm.hiddenReplies[34699] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
lbdx lbdx 11時間前 2020年4月2日 19:17 link permalink

There is a search box on the top-right corner of the table that enables you to filter content. For example, typing 'eng' will filter all English sentences.

{{vm.hiddenReplies[34700] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
AlanF_US AlanF_US 5時間前 2020年4月3日 1:20 link permalink

That works, but it's not the behavior I would have expected. For one thing, searching for "eng" also brings up non-English words containing "eng".

{{vm.hiddenReplies[34701] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
CK CK 4時間前, 編集 3時間前 2020年4月3日 2:00, 編集 2020年4月3日 3:26 link permalink

That's the default for the datatables jQuery plugin.

If you want to find English sentences, use the sort on the language column and then browse for "eng.".

I was just experimenting with this plugin a couple of days ago to display words from the New Academic Word List (NAWL).

https://bit.ly/39DLZ2T
New Academic Word List (NAWL) with Definitions in English and Japanese

I also uploaded it to here.
https://all.imfast.io/nawl/

Aiji Aiji 2時間前 2020年4月3日 4:21 link permalink

I may have a look at it some time to time. At what frequency do you plan to update the data? (If you plan it at all)

gillux gillux 1時間前 2020年4月3日 5:08 link permalink

For those who wonder what’s with the contributions from FixHashesCommand, first I would like to apologize for polluting the "latest contributions" table. This is just a one-time maintenance operation that affected 2960 sentences.

It’s the result of rumpelstilzchen’s work to fix duplicate merging issues with some sentences. FixHashesCommand is a bot that edited the sentences in order to normalize the text that otherwise would have been stored in different but visually identical ways.

{{vm.hiddenReplies[34704] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ricardo14 Ricardo14 1時間前 2020年4月3日 5:19 link permalink

The dev team doesn't stop for anything! Thanks, everyone!

lilyjohnson lilyjohnson 23時間前 2020年4月2日 7:23 link permalink
warning

この書き込みは 規約違反のためため非表示となっております。管理人と投稿者本人のみ読むことができます。

AmarMecheri AmarMecheri 2日前, 編集 2日前 2020年3月31日 21:46, 編集 2020年3月31日 22:20 link permalink

Dear all,
Since we have specified in our profiles that we wish to write our sentences under CCO 1.0 license, for the most part at least, why do we still have to specify each time that it is indeed under CCO 1.0 license, not under CC BY 2.0 FR license?
Thus, I consider that I do a double work when I contribute on Tatoeba since the translations are supposed to be under license CC BY 2.0 FR, if not precised.
Maybe it's not the fault of anybody even if you are an administrator but, one of two things, either the Kabyles who continue to do double work like me are masochists, or somebody wants to dissuade people from writing under license CCO 1.0, that is to say the Kabyles are masochists and agree to pay the price to have supported their flag. In any case, no one can speak of fair play. Thus, the masochist AmarMecheri continues, no matter what, to defend his endangered language.
I really apologize if I wrote that because I get it wrong all the time and get confused every time.

{{vm.hiddenReplies[34688] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
TRANG TRANG 2日前 2020年3月31日 22:17 link permalink

The answer is unfortunately still pretty much the same as last year:
https://tatoeba.org/eng/wall/sh...#message_31603

> Je comprends votre frustration. Pour répondre à la question du « pourquoi »,
> la raison est que personne n’a encore travaillé à améliorer ça.

We are working on involving more developers, but we're still overly understaffed for the huge amount of work that it takes to maintain and improve Tatoeba.

{{vm.hiddenReplies[34690] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
AmarMecheri AmarMecheri 1日前 2020年4月1日 14:37 link permalink

Merci d'avoir répondu aussi complaisamment alors que j'ai été peut-être quelque peu "impatient et irrité". Dommage que je ne sois pas versé dans l'informatique, sinon j'aurais aidé à résoudre le problème, qui demeure et reste entier. J'attendrai alors patiemment. Bon vent à tous!

Ooneykcall Ooneykcall 5日前 2020年3月28日 9:28 link permalink

So Tom_Facts was suspended. Kinda sad, isn't it? Obviously an alt account made to post joke sentences, which may technically be against the rules, but those sentences were all linguistically sound, just didn't make ordinary sense, for example this one #8639644. It was a nice set of jokes that in a way hearkened back to "Tom and Mary in the land of sentences" from the olden days. I am very much not happy that those funny sentences are rendered illegitimate and closed for further translations. Could an admin please unblock them, since they are grammatically correct and understandable, regardless of not making sense in the real world.

{{vm.hiddenReplies[34630] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
TRANG TRANG 5日前, 編集 5日前 2020年3月28日 9:30, 編集 2020年3月28日 9:48 link permalink

It was not for offensive reasons, nor because some don't make sense, but for legal reasons.

See my comment here:
https://tatoeba.org/eng/sentenc...omment-1167887

TRANG TRANG 5日前 2020年3月28日 9:45 link permalink

In case this situation is saddening more people, then please know that there is a technical solution: https://github.com/Tatoeba/tatoeba2/issues/1659

But somebody has to volunteer to implement it.

Until that is implemented, we unfortunately cannot accept contributions when they were very obviously mass-copied from a certain source, and that source is not compatible with CC BY.

Alternatively, someone could try to negotiate with the owners of https://chucknorrisfacts.net/ and see if they would be willing to allow reuse of their content.

{{vm.hiddenReplies[34632] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ooneykcall Ooneykcall 5日前 2020年3月28日 10:13 link permalink

> Alternatively, someone could try to negotiate with the owners of https://chucknorrisfacts.net/ and see if they would be willing to allow reuse of their content.

That's what I'm going to do. It's just a joke after all, figures they wouldn't be stuck-up about it. I think preserving the sentences in their original form, with Chuck Norris as the protagonist, would be better though, so if this goes well, I suppose what we do is you set them free (of Tom_Facts' "ownership"), I adopt them and mass edit Tom to Chuck Norris and edit my translations accordingly and ask deniko to edit his and I think no one else translated those sentences other than the PIN code one. Bit tedious but there are only 192 sentences, it wouldn't take that long.

{{vm.hiddenReplies[34633] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
TRANG TRANG 5日前 2020年3月28日 10:39 link permalink

You would have to convince them to officially add their licensing terms on the website. If you find a contact point via email, please include team@tatoeba.org in CC.

I would not be as confident as you that they would agree on allowing reuse of their content under a CC BY compatible license. In the end it depends what is their ads revenue.

Making their content CC BY compatible means that anyone can technically and legally just pump their content and create a website that directly competes with them. As a result, there will be a risk that less and less people visit their website. If they make little to no money from ads, they won't care about this risk. But if they make a significant amount of money, I would be extremely surprised that they agree to take such a risk.

{{vm.hiddenReplies[34634] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ooneykcall Ooneykcall 5日前, 編集 5日前 2020年3月28日 11:06, 編集 2020年3月28日 11:10 link permalink

In that case, if we keep 'Tom', wouldn't it help solve the issue if the site admins agreed to accept derivative content and not direct copying. That's different since a hypothetical "not Chuck Norris facts" site obviously couldn't compete as people know it's supposed to be about Chuck Norris and would think something's fishy.

I don't see why it has to be an official request, really... I mean, we're just normal people having fun, not sleazy lawyers. It's my personal initiative/request and I've sent it as such. If I'm granted any permission I'll have the email for proof, no problem.

{{vm.hiddenReplies[34636] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
TRANG TRANG 5日前 2020年3月28日 12:13 link permalink

> In that case, if we keep 'Tom', wouldn't it help solve the issue if the
> site admins agreed to accept derivative content and not direct copying.

No, it wouldn't be compatible with CC BY. We wouldn't be able to keep these sentences as part of the corpus that distribute because of this scenario:

- They allow derivatives of their jokes.
- We also allow derivatives of our sentences (that's the nature of CC BY).
- Someone copies sentences from Tatoeba into their website and changes ever sentence with "Tom" to "Chuck Norris".
- Someone has then indirectly copied sentences from the Chuck Norris facts website.

> I don't see why it has to be an official request, really...

Because it would be irresponsible and disrespectful to ignore intellectual property.

https://en.wiki.tatoeba.org/art...ting-sentences

{{vm.hiddenReplies[34640] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ooneykcall Ooneykcall 5日前 2020年3月28日 13:26 link permalink

- They allow derivatives of their jokes.
- We also allow derivatives of our sentences (that's the nature of CC BY).
- Someone copies sentences from Tatoeba into their website and changes ever sentence with "Tom" to "Chuck Norris".
- Someone has then indirectly copied sentences from the Chuck Norris facts website.

Are you expecting that to happen, really.
You didn't use to be so concerned with lawyerese I reckon, sad times if you have to fear someone could cry havoc over this, because nobody in their right mind would, but sick minds specialise at giving sound minds a headache.

> Because it would be irresponsible and disrespectful to ignore intellectual property.

That's no answer as to why there would be any need for officialdom. I'm communicating as myself, a private person not a legal entity. All I need is to make sure the admin(s) of that site approve of adding their sentences here and have no mind to object to it.

Chuck Norris jokes weren't invented by that website, too, it's the other way. They've added new jokes over the years, obviously, but most of those seem to have been submitted by others. How much of that copyright actually holds, hmm.

{{vm.hiddenReplies[34642] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
TRANG TRANG 5日前 2020年3月28日 15:09 link permalink

> You didn't use to be so concerned with lawyerese I reckon

I am not concerned when the sample of copied content is so minimal that we can still make the assumption that the contributor came up with the sentences on their own.

I have however always been concerned whenever we could very clearly identify the original source of the copied content.

> That's no answer as to why there would be any need for officialdom.

I will try to explain more clearly.

As long as *all* our sentences are exported into a dataset that we distribute under CC BY, it would be irresponsible and disrespectful to ignore intellectual property. We, Tatoeba, have some liability in regards of what is inside this dataset.

Yes, it is crowdsourced and it is not possible for us to monitor every single sentence. But that doesn't take away from us the responsibility to try our best to be compliant with the laws. If we can't do it, we have to stop releasing our content under CC BY. But I'm not going to sacrifice open data just for Chuck Norris jokes.

Personally I'm not going to be nitpicky about a handful of non-CC BY sentences. But there's a point where it starts to be a bit too much. Having close to 200 sentences gathered under the same account makes it very, very obvious what the source is. And when it becomes very obvious what the source is, it is too much.

Now again, the main blocking point is that *all* our sentences are exported into a dataset that we distribute under CC BY. If we could somehow exclude some sentences from our exports, that would solve the problem. Well it turns out that sentences set as "unapproved" are not exported. So that has been our temporary solution for keeping non CC BY sentences in Tatoeba.

But this poses another problem: we will have sentences in red even though they are correct. To solve that, we have a technical solution: https://github.com/Tatoeba/tatoeba2/issues/1659.

Once that is implemented, we can much more safely allow people to copy sentences from other sources because we can easily remove obvious non CC BY content from our CC BY dataset.

The content will no longer be labelled as "CC BY" data, it will just be on our website labelled as "no license" or "unknown license" or whatever else that makes it clear it's not for reuse (or if one chooses to reuse, it will be at their own risk).

We will still encourage everyone to cite the sources, to give attribution where it's due. We will also still remove content from Tatoeba if the original authors ask us to do so. But we don't have to worry about license compatibility.

Also, please be aware that is intellectual work on creating a collection of sentences even if each sentence of the collection are individually free of intellectual property.

For instance you can create a list of "1000 most common sentences". If you take each sentence individually, you can't argue that you own these sentences, millions of people have used them before you. But if someone was to take your exact list and publish in their language learning website as "1000 sentences for beginners", then they basically ripped off your work. Because it is intellectual work to come up with criteria for selecting what is more common.

In the case of Chuck Norris facts, they built a website, they have people accepting or rejecting the submissions, they set up a whole infrastructure to provide collect and share these jokes. It *is* work, is it intellectual value. They cannot claim that each joke belongs to them, but they can claim that the collection of jokes belongs to them. Not only that, but they have ads on their website, so it is also money. Personally, when money is involved, I don't want to make any sort of optimistic assumption.

I hope this clarifies things.

{{vm.hiddenReplies[34643] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ooneykcall Ooneykcall 5日前, 編集 5日前 2020年3月28日 16:04, 編集 2020年3月28日 16:04 link permalink

Thanks for taking your time to present a detailed account. It's not like I had zero understanding, but having an explanation laid out is quite a bit helpful; nice of you to write one up.

This analogy doesn't seem to apply to Tatoeba as it is since sentences are added individually though? I mean, I *could* add 1000 common sentences from such a list as you describe, and this couldn't be copyright violation since nobody owns any of those sentences individually. It seems there's no violation as long as I don't combine them into a list similar to the original one; only then am I arguably using someone else's work.

I have to agree that adding those sentences under an alt account set up specifically for that purpose makes it too obvious. But if a regularly contributing account/user, such as me, added common sentences from a certain source along with other sentences so their existence as a relatively large set wouldn't be visible, it would be realistically no problemo, hmm? Since it only becomes a real problem when you can easily trace those sentences and conceive of them as a single list, rather than stumble upon invididual items that do not suggest a bigger list exists. That is, if those sentences aren't of the sort that makes you think right away they might come from a single author, which Internet memes are not as a anyone may want to create more phrases with the same jocular pattern.

{{vm.hiddenReplies[34644] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
AlanF_US AlanF_US 5日前 2020年3月28日 16:54 link permalink

Though I realize you're probably posing your question hypothetically, it sounds to me like "Who cares whether it's wrong as long as we can come up with a scheme for getting away with it?" Aside from the ethical problem, it seems like a bad idea to rely on our speculations, as non-lawyers, about what might or might not get us in trouble with the law. In any case, I don't think Tatoeba is so hard up for sentences that we need to steal them.

{{vm.hiddenReplies[34647] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ooneykcall Ooneykcall 2日前 2020年3月31日 14:14 link permalink

> Who cares whether it's wrong as long as we can come up with a scheme for getting away with it?

You mean legally wrong. Common jokes shouldn't be copyrightable if we're talking morally, but oh well.

Aiji Aiji 3日前 2020年3月31日 0:15 link permalink

Let me add a pinch of explanation because I think it's important. The short answer is: no, it won't work. It could even worsen things, because a lawyer could accuse you of being fully aware of your copyright infringement and trying to hide it (aggravating circumstance). It's like saying "I laundry my dirty 2 millions, but only 2 000 by 2 000. The authorities won't catch me so it's safe."

The scenario is simple. Suppose I own the copyright on contents of some sort, here, sentences. I build a script to search the web for potential copyright infringement (if I have a lot of money, I pay for such script). Above a certain threshold (chosen more or less arbitrary), it is decided that "potential" becomes "very likely". The two common ways to deal from here are the following:
- I'm relatively a nice person so I contact you, explain the problem and ask you to kindly deal with the situation.
- I don't care at all, and demand the removal of your copyrighted content otherwise I'll send you my lawyer.

That's how it works on Youtube for example. Copyright owners won't care who the content creators are, they'll strike their videos, period. You're just a science communicator who makes instructive videos? Well, too bad you used 17 seconds of Black Eyed Peas. Talking about life in the U.S.? Well don't use Mario Main Theme.
Sometimes the content creator will make a phone call, find a reasonable people and a happy outcome (or they belong to a powerful network that can negotiate...), sometimes they won't.

And don't think that popular stuff, the folklore, or something that everybody know is safe. That's how copyright scammers work on Youtube. They claim ownership on any piece of content that is not copyrighted to be able to strike videos and get the corresponding share of the money. Sometimes they even try to claim copyright over content that were officially release for free-use...

Of course, there's no need to be paranoid. But being aware that some people are merciless help being cautious. Since Tatoeba ask people to add their own sentences, it's normal to expect that people will be cautious about adding copyrighted (or possibly copyrighted) content.

rumpelstilzchen rumpelstilzchen 3日前 2020年3月31日 4:11 link permalink

> This analogy doesn't seem to apply to Tatoeba as it is since sentences are added individually though? I mean, I *could* add 1000 common sentences from such a list as you describe, and this couldn't be copyright violation since nobody owns any of those sentences individually. It seems there's no violation as long as I don't combine them into a list similar to the original one; only then am I arguably using someone else's work.

Have you ever heard about "Database right"?: https://en.wikipedia.org/wiki/Database_right

TRANG TRANG 5日前 2020年3月28日 10:48 link permalink

As for changing back "Tom" to "Chuck Norris", I would recommend against it. Keeping Tom is okay, these are jokes that has been customized for Tatoeba.

On a side note, I would like to point out that before Tom, there was Christopher Columbus:
https://tatoeba.org/eng/tags/sh..._with_tag/1158

{{vm.hiddenReplies[34635] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
CK CK 5日前 2020年3月28日 11:10 link permalink

> I would like to point out that before Tom, there was Christopher Columbus:

I don't think this is true.

In English, the first "Columbus" sentence is #35544. The first "Tom" sentence was #1780. There were 34 other "Tom" sentences with lower numbers under 35544.

There are 652 sentences with Tom under #467000 and 18 "Columbus" sentences.

The first "Christopher Columbus" sentence is #536592. There are 671 "Tom" sentences with numbers under this.

{{vm.hiddenReplies[34637] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
TRANG TRANG 5日前 2020年3月28日 11:14 link permalink

I didn't mean that there was no sentence with Tom before. I meant that when it comes to making jokes with a protagonist who has super human abilities, this trend appeared in Tatoeba with "Christopher Columbus" before "Tom".

deniko deniko 5日前 2020年3月28日 20:42 link permalink

> Alternatively, someone could try to negotiate with the owners of https://chucknorrisfacts.net/

These jokes about Chuck Norris are basically modern folklore, I feel like it would be super weird for someone to claim any copyright on them. I'm sure chucknorrisfacts.net is not the only website that collects this folklore. Are you sure they were copied from there and they're not part of Tom_Facts's own collection? The jokes are literally everywhere.

https://www.reddit.com/r/ChuckNorris/

https://parade.com/968666/parad...-norris-jokes/

etc.

I remember reading jokes about him in this style long before chucknorrisfacts.net was created (year 2017, according to whois lookup).

{{vm.hiddenReplies[34648] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Tom_Facts_Vol2 Tom_Facts_Vol2 5日前 2020年3月29日 1:33 link permalink

+∞

{{vm.hiddenReplies[34649] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Thanuir Thanuir 4日前 2020年3月30日 5:33 link permalink

Täältä saattaisi löytyä muutama CC-lisensoituna: https://commons.wikimedia.org/w...related_images

{{vm.hiddenReplies[34666] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Tom_Facts Tom_Facts 3日前 2020年3月31日 0:35 link permalink

👍

rumpelstilzchen rumpelstilzchen 3日前 2020年3月31日 3:54 link permalink

Just because the picture of a sentence is CC licensed, doesn't make the sentence itself CC licensed.

{{vm.hiddenReplies[34679] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Thanuir Thanuir 3日前 2020年3月31日 5:27 link permalink

This file seems to be CC-BY -licensed: https://commons.wikimedia.org/w...indow_sign.jpg

The title of the file contains the sentence. (The picture does, too.) The title of the file is a part of the file. Hence, the title seems to be CC-licensed, too. Or am I missing something?

{{vm.hiddenReplies[34682] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
TRANG TRANG 2日前 2020年3月31日 22:03 link permalink

Reusing sentences that can be seen in pictures that were published under CC BY can give us a safety net, but it relies on the assumption that whoever uploaded these pictures was knowledgeable enough about intellectual property.

If @Tom_Facts (or anyone) can justify that there is a legit, creative or intellectually demanding process behind finding and converting Chuck Norris facts into Tom facts, then (to me at least) it would be a better defense than saying "I extracted these jokes from CC BY pictures".

If the process is "I'm browsing random Chuck Norris facts on some website(s) and I take those that I like, replace the name with Tom and add them to Tatoeba", the intellectual added value is... too minimal.

{{vm.hiddenReplies[34689] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Thanuir Thanuir 2日前 2020年4月1日 6:24 link permalink

En tunne Ranskan tekijänoikeuslainsäädäntöä; paras suomalainen lähde, jonka löysin, on tämä: http://www.kysy.fi/kysymys/voik...kijanoikeuksia

Siinä ei mainita, että vitsien teoskynnyksestä olisi oikeustapauksia. Monien tekijänoikeus on iän takia hävinnyt ja monet ovat osa yleistä kansanperinnettä. ”Pikku Kalle”-vitsit ja ”Suomalainen, ruotsalainen ja norjalainen”-vitsit kuulunevat näihin luokkiin, mutta ”Chuck Norris”-vitsit saattavat olla liian tuoreita.


(Tämä on yksi esimerkki siitä, miten tekijänoikeus käsitteenä haittaa ja hankaloittaa luovaa, hyödyllistä ja osittain myös tieteellistä työtä, aivan kuten monopolit yleensä ovat haitallisia. Tekijänoikeudet tulisi poistaa tai ainakin heikentää niitä suuresti. Ne ovat vahingollisia kulttuurille ja tieteelle.)

{{vm.hiddenReplies[34691] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
rumpelstilzchen rumpelstilzchen 1日前 2020年4月1日 9:22 link permalink

The problem isn't about whether a single joke is protected by copyright or not (it most likely isn't as are the majority of the sentences on Tatoeba IMHO.)

The problem is scraping a lot of jokes/sentences from another database where we don't know its license because databases can be considered as creative works by themselves independent of whether their items are creative works or just facts (i.e. protected by copyright or not).

Now the question is whether compiling a list of jokes is considered as a creative work. I'm pretty sure it is in Europe which has rather strict database laws (see the Wikipedia article I've mentioned in another message). But even in the US which as far as I know don't have such strict laws, it seems to be likely as the following passage shows:

"An example of a database that is protected as a compilation would be a database of selected quotations from U.S. Presidents. The individual quotations themselves may or may not be subject to copyright protection. However, the selection of the quotations involves enough original, creative expression that it is protected by copyright. Therefore, a database of quotations will be protected by copyright as a compilation even though some of the quotations are not protected." (from https://www.bitlaw.com/copyright/database.html )

So as long as nobody can prove reliable that the owner of chucknorrisfacts.net is ok with scraping many or all of the jokes, I think it's better to stop adding them.

{{vm.hiddenReplies[34692] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Thanuir Thanuir 1日前 2020年4月1日 11:07 link permalink

Olen samaa mieltä siitä, että kokoelma kokonaisuudessaan tai merkittävin osin saattaa olla kopiointioikeuksien suojaamaa ja sen ottaminen käyttöön ilman lupaa on epäkohteliasta.

TRANG TRANG 4日前 2020年3月29日 19:27 link permalink

First, please take the time to read fully my reply to Ooneykcall:
https://tatoeba.org/eng/wall/sh...#message_34643

> I remember reading jokes about him in this style long before
> chucknorrisfacts.net was created (year 2017, according to whois lookup).

Note that the domain chucknorrisfacts.net may have been registered in 2017, but the website itself existed since 2008 (according to their footer). It was probably under another domain before 2017.

If the "2008" indicated in the footer is a lie and they exist only since 2017, then they did a great job at replicating website design from 15 years ago...

> Are you sure they were copied from there and they're not part of Tom_Facts's
> own collection?

@Tom_Facts did not explicitly tell me they copied from chucknorrisfacts.net and I do not possess secret government agency surveillance tools, nor do I possess mind reading powers. So no, I'm not sure.

But when I check some of the sentences, I'm getting results like these: https://imgur.com/a/zzLQJQL. I can't help but being very suspicious that a large part of those jokes are copied from chucknorrisfacts.net, as it happens to the be common denominator in these results.

I'm very well aware that these jokes are everywhere. I know many of these jokes, I laughed at many of these jokes, and I wished just as much as many people here that we didn't have to worry, ever, about copyright or licensing or intellectual property.

But that won't change the fact that inserting these jokes on a large scale poses a legal risk for Tatoeba. The more are being added, the more the risk is growing.

If some content has been copied and re-copied, it wouldn't necessarily and magically become CC BY compatible. So even if @Tom_Facts didn't copy directly from chucknorrisfacts.net, but from a website that itself copied from chucknorrisfacts.net, then it can still be a problem.

And I will insist again: the main problem is not that these jokes are published on Tatoeba, the main problem is that they are *incorrectly licensed under CC BY*.

Is there any platform or blog that publishes Chuck Norris facts under a Creative Commons license? No. There's none. So we cannot be thinking "Well, everyone else uses these jokes, why can't we?". We have different legal constraints, that's why we can't.

Now you are lucky, Andreas (aka. @rumpelstilzchen) has volunteered to tackle the issue https://github.com/Tatoeba/tatoeba2/issues/1659. But until it is ready and deployed, please (everyone), don't copy more jokes and memes into Tatoeba. Don't translate them if you see any that might be legally shady. Ask people to stop if you see anyone doing it. Just wait till we have the proper features in place, or just create your own jokes instead. I have no time to play the cop so I count on everyone's cooperation. Thanks.

Aiji Aiji 2日前, 編集 2日前 2020年3月31日 13:09, 編集 2020年3月31日 13:11 link permalink

What's New on Tatoeba? - Your weekly recap °10


UPDATES

A small week it was. Some discussions here, some updates of code there, everybody needs a rest sometimes. Stay tuned next week for nice new features!


ON THE WALL

※ hamsolo474 started a thread leading to discussing various topics, such as the quality of a sentence and good search results https://tatoeba.org/fra/wall/show_message/34604

※ gillux performed a new UX test, this time on a user very familiar with Tatoeba. You can also help us by performing this kind of test and write a summary :) https://tatoeba.org/fra/wall/show_message/34618

※ Ooneykcall started a thread that led to discuss copyright and licensing https://tatoeba.org/fra/wall/show_message/34630

※ Ibdx discussed about adding more sentences containing words whose meanings are often looked for https://tatoeba.org/fra/wall/show_message/34645


CONTRIBUTIONS AND LANGUAGES

※ 15 689 sentences added this week. You can check daily activity on this page https://tatoeba.org/eng/contrib...ivity_timeline

※ This week, two languages were added, bringing the number of languages on Tatoeba to 355! Thanks to Ricardo14 and gillux for coordinating this.
On zorgzikhnit's request, Bislama has been added https://en.wikipedia.org/wiki/Bislama

On MarijnKp's request, Saterland Frisian has been added https://en.wikipedia.org/wiki/Saterland_Frisian

※ Some of our members helped translating the website (crediting using Transifex usernames). In arbitrary order:
gorkaazk, elenacristina260, herrsilen, SAmiri, Les90, yanis.batura, Aiji, gillux, arh, MarijnKp, RyckRichards, fjay69, Gulo_Luscus, robin0van0der0vliet, shekitten, Mohsin_Ali, Yorwba, easononizuka, maxine22zhang, Silja, Thanuir, Guybrush88, 58karel, michel.smts2, robin0van0der0vliet, sabretou, small_snow

----------

If you'd like to help to the development of Tatoeba, report issues, or are just curious, have a look at the GitHub repository: https://github.com/Tatoeba/tatoeba2

If you want to help us translate the website to your language, you can join us on Transifex: https://www.transifex.com/tatoe...ite/dashboard/ and check this article on the wiki https://en.wiki.tatoeba.org/art...ce-translation

----------

Fun fact: Mary Poppins’ “supercalifragilisticexpialidocious” actually appears in some dictionaries.


Last week recap: https://tatoeba.org/fra/wall/show_message/34592
See this recap on the blog: https://blog.tatoeba.org/2020/0...kly-recap.html

hamsolo474 hamsolo474 8日前 2020年3月26日 3:43 link permalink

Has there been any progress made on implementing a voting system for reliability.

https://blog.tatoeba.org/2010/0...w-will-we.html

I read this article from a decade ago and it mentioned a timescale of months and the need for at least 20 advanced contributors, as of writing there are 147 advanced contributors with 25 in german and 15 in english alone.

Is the problem that no one is willing to work on the problem from a programming sense? The desire to get it done in french first (given french only has 13 Adv contributors rather than 20)? or is there some other reason.

This is the feature I would most like to see on Tatoeba and I would be willing to try my hand at implementing a programming solution for it.

I would be suprised if I was the first to feel this way in the last 10 years, So i would like to know has anyone attempted this before me? What happened to their submission if they made one. Given how widely voting systems are used on other websites I can't imagine this being too difficult unless there is something fundamental in the design of Tatoeba that would prevent it. Does anyone know?

My questions
1. Why has there seemingly been no progress on a voting system?
2. Has anyone attempted this?
2.1 Why was their attempt rejected/not implemented?
3. Is anyone working on this now? (perhaps i could assist)
4. Is there some reason why such a thing would never work on Tatoeba?

{{vm.hiddenReplies[34604] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ricardo14 Ricardo14 8日前 2020年3月26日 5:10 link permalink

@hamsolo - anyone can join the dev team. That's what Trang posted recently:

"The very first thing you can do is to try and set up Tatoeba on your machine and let us know if you've faced any issue along the way, if there's anything we could simplify and if there's anything we should reword in our documentation. The more simple and easy to understand we can make this onboarding process, the better it is :)

The starting point is our GitHub repository: https://github.com/Tatoeba/tatoeba2

Once you're all set, we can move on to more concrete issues. Just let me know when you're ready!"

Aiji Aiji 7日前, 編集 7日前 2020年3月26日 7:05, 編集 2020年3月26日 7:06 link permalink

Because there are pros and cons. The pros are very constraining, while the cons aren't.

First of all, please think carefully of what would be the advantage of a voting system for you? And then, for all of us? And then, for all that are not us?
And first of all, what would a voting system look like?

Here are some cons (with the current system):
- A voting system needs a threshold. What is a good threshold? A majority of some sort? Even if you are in the minority, it doesn't mean that you're wrong.
- Some people could silently sabotage the corpus to defend a stance.
- We have corpus maintainers to take care of correcting sentences. If a sentence is correct, it is correct. If not, we can simply notify them and they will check if it needs to be corrected. There is no need for a voting system in this situation. If they cannot deal with a sentence, they can search the Internet or ask some help, that's what we signed for :P
- We have a "review" feature with which you can mark a sentence as "OK", "unsure", or "not OK". The current system isn't very well integrated with the proofreading process for now, but we have some ideas to improve it and make proofreading less resource-consuming and more efficient. This system relies on the community to point out sentences that may be wrong, but it does not exclude sentences. We're working towards inclusion, not exclusion of contributions.
- The majority of contributions are correct, and should be treated as such.

{{vm.hiddenReplies[34606] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
hamsolo474 hamsolo474 7日前 2020年3月27日 1:45 link permalink

First of all, please think carefully of what would be the advantage of a voting system for you?
- I get nicer search results

And then, for all of us?
- Better search result rankings where the most useful results showed appeared higher in the list is what originally separated google from it's competitors. Tatoeba is an amazing resource and we all benefit if search results are improved and it becomes more popular.

And then, for all that are not us?
- Same as above.

And first of all, what would a voting system look like?
- A did you find this useful button. 1234 other people did.

Pros
- ranking of search results
- - Triage for popular sentences
- - Most common uses are likely to be ranked first
- - Near duplicates are likely to be filtered by the community

Cons
- Slight overlap in functionality

I ran into a problem yesterday when I was looking up sentences, I found a few sentences that were correct and understandable but ultimately archaic. They fit the binary mold of technically correct or incorrect but they were poor examples of good, modern, intelligible english.

Furthermore, I understand that only advanced contributors can tag. With a voting system you could triage the neverending stream of daily sentences so advanced contributors knew which sentences were most valuable to the community, then they could apply their tagging and fixing efforts in the areas it would make the greatest difference.

Corpus works could be sabotaged, but the votes don't necessarily indicatate sentence validity but instead percieved usefulness.
Let's take the word know for example
The most common uses of know are probably in a sentence such as:
"I don't know"
"I know, don't remind me"

Then less common
"I know kung fu"
"I know algebra"

Then less common, perhaps a sentence where someone claims he knows goats in a more biblical sense.

How does a new learner distinguish the difference in meaning between knowing kung fu and biblically knowing goats

Assuming I've guessed the order of use correctly, knowing goats should be on the bottom of the list indicating to a new learner that while this is technically correct, It may not be the best sentence to put in your Anki Deck.

It should also prevent direct duplicates or near duplicates from showing up next to each other. For example "I know!" and "I know." or "I know the Jacksons" and "I know the Smiths". There are obviously differences between these sentences but a voting system could allow the community to decide between seeing lots of near duplicates on the first page then having to search further for a variety of uses or simply voting up (the presumably wider variety of) sentences they found most useful.

I don't understand why a voting system needs an upper theshold. More votes doesn't make a sentence more correct, but merely indicates it's usefullness to people. We could allow sentences to drop below zero, as an indicator that the sentence is not only not useful but perhaps as a flag that something is wrong with it. Then it would be easily searchable and fixed, then it could be reset to zero upon manual review but I am not married to the concept of a down vote system.

We could also grant votes (for example maybe 10) for manually reviewed sentences that are correct, I know there is already functionality for this sort of feature but this would be affecting the ranking of search results rather than just the visibility.

Your cons
- Voting system needs threshold
Why? Is this solved by only having upvotes?

- Potential for corpus sabotage
Is this solved by only having upvotes?

- Overlap with existing functionality
True, but this should aid integration with proofreading by allowing for triage.

- We already have Corpus maintainers
Triage makes their job easier.

- The majority of contributions are correct and should be treated as such.
"Knowing goats" and "knowing kung fu" are both correct sentences but I'm sure one is more useful than the other.

{{vm.hiddenReplies[34614] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Aiji Aiji 7日前 2020年3月27日 4:49 link permalink

You have good points, but most of them show a very common bias. You want Tatoeba to fit a personal use case. Let me explain.
From what you wrote, I could extract one fundamental problem: Help beginner learners of a language to more easily identify what sentences are more common / useful, or on a more general scale, what level of usefulness is provided by a sentence.

Now, that's indeed a very good problem. However, it lays on an erroneous assumption: Tatoeba is not a tool that aim to teach you language. Oh, it can be used as such, surely. But its fundamental mission is not that. Its fundamental mission is to provide good corpora of sentences (there's a problem on the meaning of "good" but we can discuss that another time). What you describe is a tool (for language learners), while Tatoeba is a source (of data). The tool uses the source in a twisted way to fit its need; but it cannot ask the source to bend to fit its vision.

Of course, I don't say that the problem you mentioned should be ignored, far from that. But the statement of the problem(s) to be solved shouldn't be biased by "it's easier for learners". Otherwise, you will get people saying: please do this to help my Natural Language Processing algorithm, please implement that because I could use it in the Japanese class I'm teaching.
Now, again, I don't say that those problems should be ignored. However, after listening to these problems, we should try to extract the fundamental issue, free of all biases. I think that is what Trang expressed when she described how we design features.

As a simple illustration, let's take your three pros and let me give exaggerate simple cons (for the sake of argument):

Pros
- ranking of search results
- - Triage for popular sentences
--> This would bring a vicious (not virtuous) circle of "what's popular get more popular because it's popular". Tatoeba is not about popularity. Every contribution is considered equal as long as its respect "quality-standard". Then, of course we could set "popularity" as optional, but in the end that would bring only more work compared as if the problem was tackled another way, a bias-free way, from the beginning
- - Most common uses are likely to be ranked first
--> Same problem as above. This implies a bias that shouldn't exist. Also, think about American / Australian / British / etc. or French / Canadian / Senegal / etc. Saying something is most commonly used because more users come from a particular region seems pretty unfair.
- - Near duplicates are likely to be filtered by the community
--> How so? We cannot consider one good enough, and the other not. And what if I do want near duplicates? They have values in themselves, even if they are a bother in some situations.

The last point (near-duplicate) actually depicts the best the point of view I try to defend in my post(s): If there is a problem inherent to Tatoeba, Tatoeba should try to solve it in a way that is completely independent of any particular use (language learning, NLP algorithm, translation tool, etc.).

Most of your other points could be answered in a similar way. In particular, "usefulness" is difficult concept to handle, and in your post itself I can see a potential contradiction between "the community would vote for what they think best" and "A is more useful than B".


And to summarize my ideas, let me answer quickly at one of your post below:
- Tags would not help you learn the language or distinguish what is for beginners and what is not, because that is not what they are made for (the functionality suffers from some flaws but hopefully work will be done soon to improve it).
- If you think the search doesn't allow you to find relevant results (many of us think so), please explain us why, and we could try improving the search functionality together.

{{vm.hiddenReplies[34617] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
AlanF_US AlanF_US 6日前, 編集 6日前 2020年3月27日 17:16, 編集 2020年3月27日 17:19 link permalink

I'm always surprised by the assumption that Tatoeba is primarily geared toward beginning language learners, since in my view, it's not particularly well suited to their needs. Beginning learners need guidance, a path to follow. There are much better places to find that elsewhere. I've always considered Tatoeba far more useful to learners who have had some time to get to know the basics of a language and now want to see examples of words or grammatical features in use.

In line with what Thanuir said, it doesn't take long to know what the basic meanings of "know" are, at which point sentences that contain them become much less useful to the learner than sentences with more advanced meanings. Understanding the biblical sense of "know" is not really a necessity for intermediate or even advanced learners, which means it's not a great example of a verb with a gradation of meanings, but since the example has already been used, I'll stick with it.

It doesn't take much familiarity with a language for someone to figure out that "know" is being used in an unfamiliar context, and therefore, the sentence may be of limited applicability. For instance, perhaps the sentence was this:

"For lo, the ewe goat was comely in appearance, hence the shepherd knew her twice upon the hill."

Even if I did not know English well, the presence of rare words like "lo" and "comely" would suggest to me that this was not an ordinary colloquial sentence and therefore, I would not assume that it's a typical example of an utterance I should expect to produce. Nor would it cause me to throw out what I've already learned about more standard usages of the word and conclude that knowing is something one only does to an attractive goat on a hill.

Another point I wanted to raise is that voting for the usefulness of sentences would be an extremely tedious exercise. It would be hard to convince most people to do it, and for good reason, since, as has been said, the vast majority of sentences on Tatoeba are already useful. Therefore, you'd get a small number of people voting, and hence a biased vote. I would try to convince people to put their effort into more constructive areas.

Thanuir Thanuir 7日前 2020年3月27日 6:31 link permalink

Mainitsemistasi käyttötavoista ainoastaan ”knowing goats” on uusi ja kiinnostava minulle. Tietokanta ei ole pelkästään aloittelevia kielenopiskelijoita varten.

TRANG TRANG 7日前 2020年3月26日 13:39 link permalink

> Has there been any progress made on implementing a voting system
> for reliability.

If you ask about progress since the blog article has been published: yes, there has been progress. In 2015, we introduced a feature to review sentences: https://github.com/Tatoeba/tatoeba2/pull/738

It was initially called "Collections" but we recently renamed it to "Reviews". Besides of the name change, this feature did not evolve at all ever since its introduction. It was introduced as "experimental" and still is today.

> Is the problem that no one is willing to work on the problem from a
> programming sense?

Well, there's a bit of that, but we are not just lacking developers.

There has been a shift on how we design features. It used to be that people would suggest things to change on Tatoeba and if we felt it is a good idea, we would implement what they suggested. Over time, we learned it's not a good practice. From that realization, we started trying to first understand the problems, and then design and implement then solutions.

So this idea of having some sort of voting system is just a solution. But a solution to which problem exactly? Is the problem really a problem? Does the solution really solve the problem? Maybe, maybe not. We actually don't have clarity on that.

> I would be suprised if I was the first to feel this way in the last 10 years,
> So i would like to know has anyone attempted this before me?

Besides my implementation of the reviews feature, no one else attempted anything. But I would be more than happy if you could assist in pushing this feature out of its "experimental" status.

Before that though, you said that this voting system is the feature you would like to see the most in Tatoeba. Could you elaborate why is that? What is the problem/frustration that you are facing when using Tatoeba, that you think a voting system would solve?

{{vm.hiddenReplies[34609] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
sacredceltic sacredceltic 7日前 2020年3月26日 20:19 link permalink

>So this idea of having some sort of voting system is just a solution. But a solution to which problem exactly? Is the problem really a problem? Does the solution really solve the problem? Maybe, maybe not. We actually don't have clarity on that.

That's exactly that : a solution without a problem. Because, then the question still remains : who's going to decide what is right and what is wrong ? The supposed "wisdom of the crowd" ? Which crowd ? educated crowds or uneducated ones ?
If the majority rules, then we will have to accept that uneducated crowd rules. Is it OK ?

So what ?!?

{{vm.hiddenReplies[34610] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
hamsolo474 hamsolo474 7日前 2020年3月27日 2:01 link permalink

Perhaps I wasn't clear in my original post. I agree that the "uneducated crowd" shouldn't dictate what is correct or incorrect, but they could vote on what they found useful and choose not to vote on what they did not find useful. I don't think there is a need for a downvote feature, however an upvote feature could provide better ranking for search results.

The problems as I see it are
1. Presently we are getting sentences faster than we are tagging them. Triage sentences for the adv. contributors to tag so that the sentences most popular with the community are ensured to be correct.
2. Uncommon uses are right next to common uses and for beginners it's hard to know which is which as there is no indication of usefulness of the phrase. (See my comment to Aiji above, specifically the bit about the goats.)
3. When i search for common words I find lots of duplicates and near duplicates of words. For example, "I know the Jacksons", "I know the Smiths". Filtering duplicates and near duplicates.
4. Inefficient layout of search results. A voting system would show people the things other people found particularly useful first.

Would you have a problem with a button that said
1234 people found this useful, Did you?

sacredceltic sacredceltic 7日前 2020年3月26日 20:21 link permalink
warning

この書き込みは 規約違反のためため非表示となっております。管理人と投稿者本人のみ読むことができます。

{{vm.hiddenReplies[34611] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
sacredceltic sacredceltic 7日前 2020年3月26日 20:22 link permalink
warning

この書き込みは 規約違反のためため非表示となっております。管理人と投稿者本人のみ読むことができます。

hamsolo474 hamsolo474 7日前 2020年3月27日 2:11 link permalink

Honestly, I'm learning chinese and many of the characters have multiple meanings, heavily based on context. When I search for an example sentence I don't know which sentences are common uses of the word or phrase that I'm interested in and which sentences are uncommon, but still correct uses of the word or phrase I'm interested in.

I see the problems as
1. Presently we are getting sentences faster than we are tagging them. Triage sentences for the adv. contributors to tag so that the sentences most popular with the community are ensured to be correct.
2. Uncommon uses are right next to common uses and for beginners it's hard to know which is which as there is no indication of usefulness of the phrase. (See my comment to Aiji above, specifically the bit about the goats.)
3. When i search for common words I find lots of duplicates and near duplicates of words. For example, "I know the Jacksons", "I know the Smiths". Filtering duplicates and near duplicates.
4. Inefficient layout of search results. A voting system would show people the things other people found particularly useful first.

I'll download it and take a look.

{{vm.hiddenReplies[34616] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
sacredceltic sacredceltic 6日前 2020年3月27日 6:49 link permalink

> so that the sentences most popular with the community are ensured to be correct.

Err...precisely NO. A majority of the population makes on and on the same mistakes. That’s why education was invented...
“Most popular” = most wrong, in most cases.

sacredceltic sacredceltic 6日前 2020年3月27日 6:53 link permalink

> A voting system would show people the things other people found particularly useful first.

So what is actually correct would actually disappear from sight...
I see your point.

Yorwba Yorwba 6日前 2020年3月27日 20:44 link permalink

> I'm learning chinese and many of the characters have multiple meanings, heavily based on context. When I search for an example sentence I don't know which sentences are common uses of the word or phrase that I'm interested in and which sentences are uncommon, but still correct uses of the word or phrase I'm interested in.

Could you give some examples of Chinese words you've searched where the search results didn't make clear to you which uses were the common ones?

Maybe you wanted to give an English example to make your problem clearer to non-Chinese speakers, but currently there don't seem to be any sentences involving biblically knowing goats https://tatoeba.org/cmn/sentenc...rom=eng&to=und (I half expected someone to have rectified that as a result of this discussion.) so it's not a very good example.

With a specific instance of the problem to look at, finding a solution should be easier. Maybe that solution will involve some kind of voting, maybe we can come up with something else.

{{vm.hiddenReplies[34624] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
AlanF_US AlanF_US 6日前 2020年3月27日 21:06 link permalink

> I half expected someone to have rectified that as a result of this discussion.

Fixed. See:
https://tatoeba.org/eng/sentences/show/8639180

TRANG TRANG 3日前 2020年3月30日 22:25 link permalink

Thanks for explaining your problems, hamsolo.

One thing I can say is that the various problems you mentioned are unlikely to be solved with one single solution. I'll go through them one by one and I will have to interrogate you a bit more on some points, if you don't mind.

> When I search for an example sentence I don't know which sentences are
> common uses of the word or phrase that I'm interested in and which
> sentences are uncommon, but still correct uses of the word or phrase
> I'm interested in.

I'll ask you the same thing as Yorwba on this one. Could you give some examples of Chinese words you've searched where the search results didn't make clear to you which uses were the common ones?

> 1. Presently we are getting sentences faster than we are tagging them.
> Triage sentences for the adv. contributors to tag so that the sentences
> most popular with the community are ensured to be correct.

It is true that Tatoeba doesn't provide any way to find sentences based on popularity and if your preferred way to contribute would be to proofread the most popular sentences, then we wouldn't be able to fulfill your needs at the time being. I'm wondering what is your definition of popularity though.

We have a feature that somehow measures popularity already: the favorites (users can favorite a sentence by clicking on the heart icon in the sentence menu).

My questions are:
- Does this "favorite" feature correspond to your definition of popularity, or do we have a different definition of what is popular?
- If this does not measure popularity the way you wished it was measured, then how exactly would you measure popularity?
- And what difference would it make for you to proofread the most favorited sentences compared to the most popular sentences?

> 2. Uncommon uses are right next to common uses and for beginners it's
> hard to know which is which as there is no indication of usefulness of the
> phrase. (See my comment to Aiji above, specifically the bit about the goats.)

I'm not sure if I could clearly understand your problem with your example about knowing goats.

In the end my interpretation is that you, as an English speaker who is learning Chinese at beginner level, when you browse/search Tatoeba for sentences to add to your Anki deck, you are often having trouble figuring out which sentences would be the most useful to add to your deck.

If that is a correct interpretation of your situation, then perhaps you could explain to us what is your workflow on using Tatoeba to build your Anki deck?

> 3. When i search for common words I find lots of duplicates and near
> duplicates of words. For example, "I know the Jacksons",
> "I know the Smiths". Filtering duplicates and near duplicates.

On the issue of finding lots of near-duplicates, I recommend you set the sort option to "Random" rather than "Relevance" when you search sentences. It can happen that two near-duplicates appear on the same page, but common words usually have 1000+ results. For common words, it would be extremely unlucky for you to have two near duplicates on the same page.

> 4. Inefficient layout of search results. A voting system would show
> people the things other people found particularly useful first.

Assuming we use a voting system to measure usefulness of search results, I think we would need to associate each vote to a specific search. A sentence cannot be universally more useful than another. Maybe the sentence "I know algebra" would be useless for someone who searched "know" but would be useful for someone who search "algebra".

But I think upvoting for useful sentences would be very inefficient compared to reporting bad search results. You would need millions of votes and you wouldn't really be sure that those votes will help. On the other hand, just one person reporting to us that a certain sentence was not useful for a certain search could help us make actual improvements.

I feel this 4th problem is in the end the same problem as your 2nd problem. The way sentences are ordered feels inefficient for your task of building an Anki deck.

But if your use case here isn't about trying to build an Anki deck, then it would be helpful to know what are the other contexts in which you have experienced inefficient search results. What did you search exactly and for what purpose did you need to search this? Were you trying to understand the lyrics of a song? Were you trying to write a sentence in Chinese to a Chinese acquaintance?

{{vm.hiddenReplies[34670] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
hamsolo474 hamsolo474 3日前 2020年3月31日 4:36 link permalink

Q: I'll ask you the same thing as Yorwba on this one. Could you give some examples of Chinese words you've searched where the search results didn't make clear to you which uses were the common ones?

A: 后来, Yorwba and I recently found out that we intrepreted this word in two different ways, In my dictionary and in the chinese grammar wiki this is a word that means afterwards. This is my understanding of it. However Yorwba told me that 后来 did mean afterwards but had additional connotations making it mean something closer to, "afterwards it was suprisingly revealed", pointing me to a sentence page on Tatoeba, where there were admitedly sentences with that connontation.

Fortunately I had studied this phrase on other websites such as the chinese grammar wiki as well as discussed it with my native chinese girlfriend and I knew that while this is a possible connotation of the word it is certainly not the most common meaning.

However my biggest problem is that I have only studied 后来 and a few hundred other words to the extent where can be confident in using them and knowing that others will understand them, leaving more than 90% of the other words in mandarin full of potential ambiguity. So while I have yet to learn something from Tatoeba that is wrong, then use it and be corrected, remember where I learned that use from, go back to the source and request it be corrected; I acknowledge that it is a definite possibility.

Q: - Does this "favorite" feature correspond to your definition of popularity, or do we have a different definition of what is popular?

A: This is likely a good solution to my problem, I will have to play around with this for a while.

Q: If this does not measure popularity the way you wished it was measured, then how exactly would you measure popularity?

A: A visible number next to each sentence saying this many people thought this was a useful sentence, perhaps we could even put natives vs second language votes.

Q: And what difference would it make for you to proofread the most favorited sentences compared to the most popular sentences?

A: At this point I'm not sure and I'll have to play with the favourite feature a bit.

Q: On the issue of finding lots of near-duplicates, I recommend you set the sort option to "Random" rather than "Relevance" when you search sentences. It can happen that two near-duplicates appear on the same page, but common words usually have 1000+ results. For common words, it would be extremely unlucky for you to have two near duplicates on the same page.

A: This is a good solution.

Q: But I think upvoting for useful sentences would be very inefficient compared to reporting bad search results. You would need millions of votes and you wouldn't really be sure that those votes will help. On the other hand, just one person reporting to us that a certain sentence was not useful for a certain search could help us make actual improvements.

A: As an outsider/newbie I honestly can't tell if relying on human maintainers to do this is efficient and I understand that a sentence/keyword search engine is different to the google search engine which looks for sentences/keywords on webpages. However google beat yahoo (who at the time had humans manually review and categorise pages) by relying on an algorithm that was constantly fed new data based on the relevance of their results. (They saw which results were the most popular for a given search and gave them priortiy ranking). Doing what google did is likely beyond me, but the first step is understanding what is popular.


Q: What did you search exactly and for what purpose did you need to search this? Were you trying to understand the lyrics of a song? Were you trying to write a sentence in Chinese to a Chinese acquaintance?

A: As I learn new words or phrases in chinese I try to make sentences with them, practice using them etc... and chinese grammar has a few hard to predict differences compared to english grammar. For example, in english we would say that red is almost always an adjective, for example a red ball, red hair, red paint, red car. In chinese colours are nouns, even when used to describe something like a red ball, red hair, red paint or red car. This changes the words you can use with such words, I personally have trouble remembering every word I learn as an adjective, noun, verb etc.. even in english I just remember context, can produce a few sentences and figure it out, but the fact that run or climb are verbs are not saved in my memory like they would be written on the page of a dictionary. I just know how to use them and understand the rules that define the grammar (in english at least). Perhaps it is a flawed approach but this is also how I am trying to learn chinese, learn enough sentences and attempt to gain an intuitive understand collocations the same way I do in english.

In short, Nothing in particular, I'm just trying to build my mental list of sentence examples so I can produce new sentences in the near future. Chinese is hard.

Hybrid Hybrid 4日前 2020年3月29日 17:46 link permalink

I hope that everyone is doing well despite the virus. Stay strong and far away from each other! 😊

{{vm.hiddenReplies[34660] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Ricardo14 Ricardo14 4日前 2020年3月29日 21:32 link permalink

Thanks, Hybrid! The same for you!

{{vm.hiddenReplies[34662] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Hybrid Hybrid 3日前 2020年3月31日 0:15 link permalink

Thank you.

Balamax Balamax 4日前 2020年3月29日 23:05 link permalink

Try to stay inside the Solar system. Otherwise the interstellar police will have reasonable questions for you. :)

{{vm.hiddenReplies[34663] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Hybrid Hybrid 3日前 2020年3月31日 0:16 link permalink

Thank you. Is going to Pluto allowed or is that outside of the Solar System?

{{vm.hiddenReplies[34674] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Balamax Balamax 3日前 2020年3月31日 0:27 link permalink

It takes about five hours for sunlight to reach Pluto.

gillux gillux 7日前 2020年3月27日 6:30 link permalink

I am publishing a new UX test: https://en.wiki.tatoeba.org/art...show/ux-test-4 This time I’ve performed the test on somebody who is very familiar with Tatoeba already, so I don’t know if I can really call it a UX test. That said, it contains relevant feedback, including about the use of Tatoeba in a teaching environment.

{{vm.hiddenReplies[34618] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
rumpelstilzchen rumpelstilzchen 6日前 2020年3月27日 18:05 link permalink

Thanks for another UX test.

Just a quick comment:
> Having the list as a DOC file would be useful to R. because he can freely edit the text. Currently, he has to copy and paste sentences from the CSV file into a Word document to edit them the way he wants.

I don't use Word (or any word processor) so this may be a strange question, but why can't Word open/import the csv file? This is a simple plain text file.

{{vm.hiddenReplies[34623] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Guybrush88 Guybrush88 6日前, 編集 6日前 2020年3月27日 21:25, 編集 2020年3月27日 21:33 link permalink

Sometimes I tried to open the csv file with the exported sentences with Libreoffice (the import feature was still working, and I used the exported file to grab the sentences I wanted to mass translate more quickly), and the software always told me that the file was too big and not everything was shown. Using gedit (on Linux) and Notepad++ (on Windows) worked for my purpose, since they showed all the sentences

{{vm.hiddenReplies[34626] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
Yorwba Yorwba 6日前 2020年3月27日 21:58 link permalink

Putting all the exported sentences into a single .DOC file would probably make Word choke, but the excerpt from the UX test report is specifically about exporting lists, where file size is less likely to be a problem.

gillux gillux 5日前 2020年3月28日 7:29 link permalink

I think nothing’s preventing Word from opening the CSV file as plain text, however I assume the file extension CSV is associated with Excel, so opening it with Word is rather counter-intuitive. I imagine users have to right click → "open with", and then find Word from whatever selection box pops up. Compare this with simply double-clicking on the file.

As pointed out in the test, most users are not familiar with the CSV format to begin with, so they don’t know whether they should open it with Word, Excel of whatever.

To put it another way: of course, if you have the knowledge and the skills you can do whatever you want with whatever file format.

{{vm.hiddenReplies[34628] ? 'expand_more' : 'expand_less'}} 返信を非表示 返信を表示
rumpelstilzchen rumpelstilzchen 5日前 2020年3月28日 12:30 link permalink

> As pointed out in the test, most users are not familiar with the CSV format to begin with, so they don’t know whether they should open it with Word, Excel of whatever.

So should we add some info text about how to open the file in a word processor? Or change the file extension to TXT?

CK CK 5日前 2020年3月28日 8:56 link permalink

** Tatoeba.org Native Speakers **

http://bit.ly/nativespeakers

Find out who the native speakers are in the languages that you are studying.

This has been updated.