Seinä (5 770 viestiketjua)
Ennen kun esität kysymyksen, varmista, että olet lukenut UKK:n.
Pyrimme ylläpitämään tervettä ilmapiiriä sivistyneelle keskustelulle. Luethan meidän sääntömme huonoa käytöstä vastaan.
8 tuntia sitten
17 tuntia sitten
18 tuntia sitten
20 tuntia sitten
** Stats - 2020-04-04 - Native Speakers with Contributions **
Or include the 3-letter code for your native language like this for more links.
Here is the same data without the translation links, but with links to the members' profiles instead.
This uses the DataTables jQuery plugin, so you can easily search members' usernames.
** Stats: Members Manually Editing Transcriptions **
** 2020 Daily Contributions Counts by Langauge **
Date Range of the Data: 2020-01-01 00:00:00 UTC to 2020-04-03 23:59:59 UTC
Here is a variation on the above.
This lists the language that had the most contributions of each day between 2020-01-01 and 2020-04-03.
Hi there !
Where I am currently, my connection is random and not guaranteed. Also, I will not necessarily be able to respond to comments from those who ask me for corrections. The confinement does not allow me to go to repair my computer and I keep my mobile for emergency communications. I hope the best for everyone.
To be relevant, a corpus of example sentences should cover the words that are of most interest in each language.
However, even in English, Tatoeba is still far from achieving this. For example, the word 'compliance' is one of the top 50 most searched for words on the Linguee online dictionary but only has 6 sentences on Tatoeba. These sentences are only translated into seven languages and none of them into languages as important as Spanish, Portuguese or French. At the same time, 'dog' is only the 9357th most searched word on Linguee but has more than 5000 sentences on Tatoeba.
So it seems that we contributors, spend too much time creating and translating sentences that will never be read by non-contributing Tatoeba users. On the other hand, we probably don't pay enough attention to the words that are the most requested such as 'relevant', 'scope', enhance' or 'furthermore'.
Perhaps a system that better balances the supply and demand of sentences could be set up. Starting from the most searched words on Linguee or Tatoeba, we could suggest the most underused words to contributors who would like to maximize their impact on the corpus enhancement. For example, for a given language, we could highlight the 100 most requested words that do not have at least 20 sentences translated in the ten most important languages of the corpus.
Has a similar proposal ever been debated? Do you think it would be worth implementing?
Data source : https://www.linguee.com/english...ish/1-200.html
I think that it would be quite worthwhile to find a way to cover frequently requested but underused words. However, I believe that it would be best to do this through coordinated communication rather than by trying to change the site's software. It takes a long time to agree upon and implement software changes. Furthermore, if this initiative is successful, it will remove some of its own basis for existence, another reason we wouldn't want it "baked into" our software.
If you want to easily try searching for these 200 words and phrases, I've set up a page for you to easily do so.
You can get additional links on the page by entering your own native language's code at the end of the following URL, instead of "fra".
By scraping tatoeba.org, I identified the 20 most highly searched words on Linguee that are underused on Tatoeba (less than 20 sentences including the given word).
linguee_rank word nb_sentences
17 scope 19
23 compliance 6
44 leverage 12
54 furthermore 11
85 default 14
103 retention 2
107 facilitate 10
141 procurement 16
150 pending 13
205 stakeholder 2
207 vendor 16
259 invoice 17
266 alignment 11
268 incentive 15
277 framework 17
281 align 11
282 equity 7
290 gauge 19
294 venue 7
295 mitigate 12
Maybe CK could build a table from this data so that users can easily add new sentences and translations ?
If it gains traction, we could update this list on a weekly basis.
Note that it's possible to get all sentences as a single file from the downloads page https://tatoeba.org/eng/downloads and thanks to rumpelstilzchen's work it's now also easy to only download sentences in a certain language.
"Scraping tatoeba.org" sounds like you wrote a script, so using that file might be a more comfortable way to collect frequency statistics.
Good point. On the other hand, by scraping the data, you benefit from the Tatoeba advanced search based on text stemming right away.
You'll notice that "scope" actually has more than 19 sentences.
You forgot to include the orphans.
Orphans are excluded automatically on the assumption that they are less trustworthy than owned sentences, even though some are likely better than some non-native owned sentences.
Orphans are included in the exported data.
Haluaisiko joku englanninkielinen adoptoida, kenties muokkaamisen jälkeen, loputkin näistä orvoista?
A few things to consider:
(1) While words with many different senses, like "set", need a wide variety of sentences, even ten sentences is pretty good for getting an idea of how to use a word like "mitigate", as long as the sentences are not near-duplicates of each other. Of course, there are a lot of near-duplicates in our sentences, and the number keeps increasing.
(2) The stemmer misses some correspondences. Thus, a search for "retention" misses "retain", and a search for "retain" finds 36 hits.
(1) It is true that it is difficult to define a threshold suitable for both monosemic and polysemic words. It may be preferable to start with a low threshold at 10 and then raise it to 20 in a second round.
(2) If the stemmer is imperfect, it is probably better to use the exact match. This is the option chosen on this page: https://tatoeba.org/eng/Vocabul..._sentences/eng . Another advantage is that it makes it easier to process the data in the script.
With these parameters, we get the following list for English:
23 compliance 7
44 leverage 9
103 retention 1
107 facilitate 7
141 procurement 1
205 stakeholder 1
206 comprise 9
228 evaluation 9
266 alignment 4
268 incentive 9
281 align 4
282 equity 8
294 venue 5
295 mitigate 2
299 liability 9
314 preliminary 6
321 hub 9
337 offset 2
343 amend 4
344 retrieve 9
I think Linguee and Tatoeba are too different projects to be directly compared.
You are talking about "words that are of most interest". But to who? It depends on the person, on your mother tongue and culture, on what you are looking for etc. Does having more people querying a word really make that word more important? (open question)
You are talking about "languages as important as Spanish, Portuguese or French", but what makes them so important? Again, it depends on the person, on your mother tongue and culture, on what you are looking for.
Linguee is good at answering the demand of the market of translations, but that is a specific and restricted view on languages. For example, Linguee will help me a lot if I have to translate legal papers, but it won’t help me if my job is to create subtitles. Tatoeba has a broader approach that makes it, by design, less efficient on specific fields.
If I am a translator working with English, Spanish or French, I’d rather use Linguee than Tatoeba to help me with a translation job. In contrast, if I am a beginner learning Esperanto, or even Portuguese, I’d rather use Tatoeba. Or if I’m looking for sentence pairs I can legally re-use in my online courses or learning app, I’d rather use Tatoeba.
Note that I am not trying to denigrate your comment. To the contrary, I think you are raising important questions. I just want to show that you (as well as everyone, including me) have a bias about what is "relevant" or "important" or "should be on Tatoeba".
You are right, Linguee is different from Tatoeba. If I used Linguee's rankings, it's for lack of a better one. Of course, if Tatoeba was mainly used to translate Estonian subtitles into Esperanto, then more Estonian sentences would have to be created. ^^
By the way, it would also be interesting to know which words are the most searched for on Tatoeba. Is such data compiled somewhere?
In any case, the vocabulary related to the business world is not very covered by the contributors, and this is detrimental to the completeness of the corpus. I am pretty confident that for a developer "looking for sentence pairs I can legally re-use in my online courses or learning app", exhaustiveness or at least large diversity is not a detail.
We have logs of all queries made to the search engine up to one year ago, but it’s just raw data and it’s not publicly available. I can anonymize this data and put it online if you want to play around with it.
Having access to this data would be great. I would enable me to update my table with even more pertinent data.
I compiled the search query logs into a CSV file: https://downloads.tatoeba.org/s...ueries.csv.bz2
Note that for technical reasons each query is reported twice, so you want to divide numbers by two.
I didn’t include empty searches. I also only included queries on page 1 (that is to say, browsing page 2, 3, 4 etc. of the search results is not counted as actual queries).
Here are all those queries counted.
You'll have to divide the number by 2 yourself.
https://all.imfast.io/queries_c...umber_by_2.zip (13 MB file)
I sorted them by counts, then by language.
count + tab + language_code + tab + query
Also, here are files with only English.
Only the English queries, with the actual counts (already divided by 2)
https://all.imfast.io/eng-queries.zip (1.7 MB)
Only the English queries without spaces, with the actual counts (already divided by 2)
https://all.imfast.io/eng-queri...out-spaces.zip (4 MB)
Note that some lines with "eng" were not actually English.
I also converted all queries to lowercase before making the counts.
Thank you gillux for responding so quickly. I have started to analyze the data and I can already confirm that the ranking of the English queries is very different from the Linguee one. Here is the top10:
There is still a lot of data cleaning to be done to be able to find the queries that don't get sentences on Tatoeba, but I think we will come up with something usable. I'll keep you posted.
From an English learner's point of view, lists such as the NGSL are likely much more useful.
Here is an easy way to browse sentences from the Tatoeba Corpus in 2014.
At the time I built this set of pages, List 907 had a little over 300,000 sentences and 123,278 of these had audio.
There are now over 750,000 sentences on List 907 and over 490,000 English sentences with audio. My plan is to rebuild these pages after I get to 500,000 English sentences with audio.
Eri kohdeyleisö, luultavasti. NGSL luultavasti aloittaa yleisimmistä sanoista (jotka on helppo oppia mistä tahansa), sinnä missä aloitusviestissä mainittu lähde näyttää käsittävän sanoja, joiden merkitys ihmisille on epäselvä. Oletettavasti niistä kiinnostuneet ihmiset ovat jo päässet aloittelijasanojen ongelmista ohi.
I'm not exactly sure what you're saying since machine translation doesn't always correctly translate things, but perhaps you missed several of the other lists on that site, and the fact that my 2014 set of pages listed some of the upper-level word families numbered-sequentially after the main general list word families.
NEW ACADEMIC WORD LIST
TOEIC WORD LIST
BUSINESS SERVICE LIST
Briefly, for those who don't want to read the whole NGSL website...
A person mastering the 2,368 "word families" (not just single words) should be able to understand up to 92% of what they read.
A word family is something like this.
friend, friends, friendly, unfriendly, friendless, friendship
If you also can master the additional 960 word families on the NAWL, you should be able to handle college textbooks, academic journals, etc. About 92% coverage, since you will also need to learn vocabulary specific to your field of study.
The Business Service List is similar to the NAWL, except it's aimed at business English. An additional 1,700 "word families" over the 2,368 on the NGSL should give you about 97% coverage.
The main purpose of such lists is to help people studying English focus on the basic "word families" that are most frequently used, so people can get to the point where they can understand a very high percentage as quickly as possible.
(This is just a brief explanation. Read the website for more details.)
Selvä. Vilkaisin sivustoasi ja siinä näkyi pääosin yleislistan sanoja. Kyllä ne muutkin listat sieltä näyttävät löytyvän, mutta niidenkin sanasto näyttää aika yleiseltä, mikä lienee niiden tarkoituskin.
And of course, you are welcome to add more sentences based on your idea of what words should be better covered on Tatoeba. ^^
That's what I started doing.
With this suggestion, I'm just trying to contribute by publishing landmarks for those who lack inspiration and would like to make sure that they create or translate sentences that people want to read.
Just two points:
- You could add vocabulary items for words that you want more represented. Right now, the vocabulary feature is difficult to use efficiently (but it will change!) but that way you can directly follow the number of sentences for each of the words you're interested in and see what sentences exist. While your limit was 20, the limit for "sentences wanted" is 10, meaning that words appearing in more than ten sentences will not appear on the "wanted" page.
- Please do not add sentences from Linguee directly ^^ (except those whose source is compatible with Tatoeba)
I'm glad you mention this feature and learn that it will be improved soon. I'm already using it and I like it a lot. While browsing it for different languages, I noticed that it could perhaps be improved in two ways.
- Sometimes the requested expressions are formulated in such a way that it is impossible to answer the request: spelling mistakes, use of the equal sign to specify the meaning of the word... By limiting the duration of the request, these complicated expressions could be evacuated naturally.
- I also noticed that for some languages, this feature is not very used. Maybe the lists could be automatically filled with the underused expressions searched on Tatoeba at least x times during the last y days by different users. This option would have the advantage to better connect simple users with the corpus contributors and to be applicable for all languages.
Thanks to your feedback, I have created a new tool that can be found at https://tatominer.imfast.io.
It will help contributors who want to diversify Tatoeba's vocabulary while paying special attention to expressions that are often searched for by users of bilingual dictionaries.
This is nice. Would it be possible to provide a drop-down to select the language? Most people will probably only be interested in one language at a time. While it's possible to sort by language and then page to the place where sentences in the desired language starts, it's a cumbersome process.
There is a search box on the top-right corner of the table that enables you to filter content. For example, typing 'eng' will filter all English sentences.
That works, but it's not the behavior I would have expected. For one thing, searching for "eng" also brings up non-English words containing "eng".
That's the default for the datatables jQuery plugin.
If you want to find English sentences, use the sort on the language column and then browse for "eng.".
I was just experimenting with this plugin a couple of days ago to display words from the New Academic Word List (NAWL).
I moved it it to here.
New Academic Word List (NAWL) with Definitions in English and Japanese
And, also created this one.
New General Service Word List (NGSL) with Definitions in English and Japanese
> That's the default for the datatables jQuery plugin.
Whether or not it's the default for the underlying software, expected behavior for a UI is for a search field with typed text to search for text that can vary freely, while a drop-down is used to search for constrained text. Not only does using the same search field for both violate my expectation, but as I said, it mixes up the results. If I search for "eng", I find, in addition to the English words, the French word "engendrer". Similarly, a search for "fra" brings up a Portuguese and a German term that happen to contain the string "fra". It's only a matter of luck that this particular vocabulary list and list of language identifiers don't overlap more.
Naturally, the tool can be used in its current form, and I appreciate the fact that Ibdx set it up for us. I'm just suggesting that separating the specification of the language from the specification of the entry would make the tool more usable.
I may have a look at it some time to time. At what frequency do you plan to update the data? (If you plan it at all)
That's a good tool to get some inspiration sometimes.
[EDIT] Also note that some of your entries are incorrect, but that's not an error on your side. For example, there are obvious ones like "linguee" or "so" and the ones erroneous because of linguee, like "oeuvre" that should be "œuvre"
Updating this web page is not really an issue for me since the process is fully automated. As both Tatoeba download page and Linguee rankings are updated weekly, a new version every Saturday seems doable.
As for the few errors, you rightly noticed that they come from the Linguee data. Sadly, I don’t see any easy way to spot them programmatically for now.
I added sentences for 50 of the words.
Thank you, the quality of your sentences is of great benefit to Tatoeba.
** Stats: 2020-04-04-and-2019-04-04.png **
I would like to have some example sentences using the words 'good-man' and 'good-wife'. Based on context the meaning might differ from that of 'a good man/wife'.
1. I added them to my vocabulary. However, the search engine seems to consider 'good man' and 'good-man' as identical expressions. Is there a way to add 'good-man' to my vocabulary?
Remark: In Finnish, when writing a compound word, if the first word ends and the last word begins with the same vowel, we write it thusly: 'ala-aste', 'linja-auto'. Writing 'ala aste' would mean two separate words (and typically be a mistake), though there might be cases where it would be correct language and would have a different meaning. As such, it would be nice if the search engine did not confuse these kinds of expressions, which mean different things.
2. I would appreciate examples of 'good-wife' and 'good-man' in the corpus. Examples illustrating whether these are the same as 'a good wife' (or man) would be particularly appreciated.
As you discovered, the search engine currently doesn’t make any difference between 'good-man' and 'good man'. If the hyphen were to be treated like a normal character, it means sentences with 'good-man' wouldn’t show up when searching for 'good,' 'man' or any word sharing the same stem.
That said, our search engine (Manticore) has a feature I believe is exactly what we need: blended characters . This feature would allow a sentence containing 'good-man' to be found by searching for 'good-man' as well as 'good' or 'man'.
Now my question is: should we treat the hyphen character as a blended character in all languages by default, or only in Finnish? I feel like other languages such as French or English could benefit from it, but I wonder if it could cause any harm in other languages I don’t know.
By the way, in Finnish, isn’t the colon character also used in a similar way, when you want to decline an abbreviation or something? I’m talking about this: https://en.wikipedia.org/wiki/C...ffix_separator
The Wikipedia article seems correct. For the definitive source, see https://www.kielikello.fi/-/kaksoispiste- . Note that colon is not used to combine separate words. (Colon also has other uses such as with quotes.)
Example of genitiv:
metri – metrin
m – m:n
Norwegian (bokmål) uses some kind of dash to the same effect, so that the definite singular of tv is tv-en.
The same kind of dash is used to bind together words in some compound words, it looks like: https://no.wikipedia.org/wiki/Bindestrek
I think that blended characters would not help me here, but it might otherwise be a fine idea. I am not sufficiently knowledgeable about Manticore to have a strong opinion here.
On an unrelated note: Gmail indicated the email notification of your response as suspect. It was sent from the address firstname.lastname@example.org .
fra: noreply <email@example.com>
dato: 3. apr. 2020, 09:20
emne: Tatoeba - gillux has replied to you on the Wall
sendt av: gmail.com
signert av: gmail.com
sikkerhet: Standardkryptering (TLS) Finn ut mer
>On an unrelated note: Gmail indicated the email notification of your response as suspect. It was sent from the address firstname.lastname@example.org .
That happened with me too.
For those who wonder what’s with the contributions from FixHashesCommand, first I would like to apologize for polluting the "latest contributions" table. This is just a one-time maintenance operation that affected 2960 sentences.
It’s the result of rumpelstilzchen’s work to fix duplicate merging issues with some sentences. FixHashesCommand is a bot that edited the sentences in order to normalize the text that otherwise would have been stored in different but visually identical ways.
The dev team doesn't stop for anything! Thanks, everyone!
Tämän viestin sisältö on sääntöjemme vastainen ja se on siksi piilotettu. Viesti näkyy vain ylläpitäjille ja viestin kirjoittajalle.
Since we have specified in our profiles that we wish to write our sentences under CCO 1.0 license, for the most part at least, why do we still have to specify each time that it is indeed under CCO 1.0 license, not under CC BY 2.0 FR license?
Thus, I consider that I do a double work when I contribute on Tatoeba since the translations are supposed to be under license CC BY 2.0 FR, if not precised.
Maybe it's not the fault of anybody even if you are an administrator but, one of two things, either the Kabyles who continue to do double work like me are masochists, or somebody wants to dissuade people from writing under license CCO 1.0, that is to say the Kabyles are masochists and agree to pay the price to have supported their flag. In any case, no one can speak of fair play. Thus, the masochist AmarMecheri continues, no matter what, to defend his endangered language.
I really apologize if I wrote that because I get it wrong all the time and get confused every time.
The answer is unfortunately still pretty much the same as last year:
> Je comprends votre frustration. Pour répondre à la question du « pourquoi »,
> la raison est que personne n’a encore travaillé à améliorer ça.
We are working on involving more developers, but we're still overly understaffed for the huge amount of work that it takes to maintain and improve Tatoeba.
Merci d'avoir répondu aussi complaisamment alors que j'ai été peut-être quelque peu "impatient et irrité". Dommage que je ne sois pas versé dans l'informatique, sinon j'aurais aidé à résoudre le problème, qui demeure et reste entier. J'attendrai alors patiemment. Bon vent à tous!
So Tom_Facts was suspended. Kinda sad, isn't it? Obviously an alt account made to post joke sentences, which may technically be against the rules, but those sentences were all linguistically sound, just didn't make ordinary sense, for example this one #8639644. It was a nice set of jokes that in a way hearkened back to "Tom and Mary in the land of sentences" from the olden days. I am very much not happy that those funny sentences are rendered illegitimate and closed for further translations. Could an admin please unblock them, since they are grammatically correct and understandable, regardless of not making sense in the real world.
It was not for offensive reasons, nor because some don't make sense, but for legal reasons.
See my comment here:
In case this situation is saddening more people, then please know that there is a technical solution: https://github.com/Tatoeba/tatoeba2/issues/1659
But somebody has to volunteer to implement it.
Until that is implemented, we unfortunately cannot accept contributions when they were very obviously mass-copied from a certain source, and that source is not compatible with CC BY.
Alternatively, someone could try to negotiate with the owners of https://chucknorrisfacts.net/ and see if they would be willing to allow reuse of their content.
> Alternatively, someone could try to negotiate with the owners of https://chucknorrisfacts.net/ and see if they would be willing to allow reuse of their content.
That's what I'm going to do. It's just a joke after all, figures they wouldn't be stuck-up about it. I think preserving the sentences in their original form, with Chuck Norris as the protagonist, would be better though, so if this goes well, I suppose what we do is you set them free (of Tom_Facts' "ownership"), I adopt them and mass edit Tom to Chuck Norris and edit my translations accordingly and ask deniko to edit his and I think no one else translated those sentences other than the PIN code one. Bit tedious but there are only 192 sentences, it wouldn't take that long.
You would have to convince them to officially add their licensing terms on the website. If you find a contact point via email, please include email@example.com in CC.
I would not be as confident as you that they would agree on allowing reuse of their content under a CC BY compatible license. In the end it depends what is their ads revenue.
Making their content CC BY compatible means that anyone can technically and legally just pump their content and create a website that directly competes with them. As a result, there will be a risk that less and less people visit their website. If they make little to no money from ads, they won't care about this risk. But if they make a significant amount of money, I would be extremely surprised that they agree to take such a risk.
In that case, if we keep 'Tom', wouldn't it help solve the issue if the site admins agreed to accept derivative content and not direct copying. That's different since a hypothetical "not Chuck Norris facts" site obviously couldn't compete as people know it's supposed to be about Chuck Norris and would think something's fishy.
I don't see why it has to be an official request, really... I mean, we're just normal people having fun, not sleazy lawyers. It's my personal initiative/request and I've sent it as such. If I'm granted any permission I'll have the email for proof, no problem.
> In that case, if we keep 'Tom', wouldn't it help solve the issue if the
> site admins agreed to accept derivative content and not direct copying.
No, it wouldn't be compatible with CC BY. We wouldn't be able to keep these sentences as part of the corpus that distribute because of this scenario:
- They allow derivatives of their jokes.
- We also allow derivatives of our sentences (that's the nature of CC BY).
- Someone copies sentences from Tatoeba into their website and changes ever sentence with "Tom" to "Chuck Norris".
- Someone has then indirectly copied sentences from the Chuck Norris facts website.
> I don't see why it has to be an official request, really...
Because it would be irresponsible and disrespectful to ignore intellectual property.
- They allow derivatives of their jokes.
- We also allow derivatives of our sentences (that's the nature of CC BY).
- Someone copies sentences from Tatoeba into their website and changes ever sentence with "Tom" to "Chuck Norris".
- Someone has then indirectly copied sentences from the Chuck Norris facts website.
Are you expecting that to happen, really.
You didn't use to be so concerned with lawyerese I reckon, sad times if you have to fear someone could cry havoc over this, because nobody in their right mind would, but sick minds specialise at giving sound minds a headache.
> Because it would be irresponsible and disrespectful to ignore intellectual property.
That's no answer as to why there would be any need for officialdom. I'm communicating as myself, a private person not a legal entity. All I need is to make sure the admin(s) of that site approve of adding their sentences here and have no mind to object to it.
Chuck Norris jokes weren't invented by that website, too, it's the other way. They've added new jokes over the years, obviously, but most of those seem to have been submitted by others. How much of that copyright actually holds, hmm.
> You didn't use to be so concerned with lawyerese I reckon
I am not concerned when the sample of copied content is so minimal that we can still make the assumption that the contributor came up with the sentences on their own.
I have however always been concerned whenever we could very clearly identify the original source of the copied content.
> That's no answer as to why there would be any need for officialdom.
I will try to explain more clearly.
As long as *all* our sentences are exported into a dataset that we distribute under CC BY, it would be irresponsible and disrespectful to ignore intellectual property. We, Tatoeba, have some liability in regards of what is inside this dataset.
Yes, it is crowdsourced and it is not possible for us to monitor every single sentence. But that doesn't take away from us the responsibility to try our best to be compliant with the laws. If we can't do it, we have to stop releasing our content under CC BY. But I'm not going to sacrifice open data just for Chuck Norris jokes.
Personally I'm not going to be nitpicky about a handful of non-CC BY sentences. But there's a point where it starts to be a bit too much. Having close to 200 sentences gathered under the same account makes it very, very obvious what the source is. And when it becomes very obvious what the source is, it is too much.
Now again, the main blocking point is that *all* our sentences are exported into a dataset that we distribute under CC BY. If we could somehow exclude some sentences from our exports, that would solve the problem. Well it turns out that sentences set as "unapproved" are not exported. So that has been our temporary solution for keeping non CC BY sentences in Tatoeba.
But this poses another problem: we will have sentences in red even though they are correct. To solve that, we have a technical solution: https://github.com/Tatoeba/tatoeba2/issues/1659.
Once that is implemented, we can much more safely allow people to copy sentences from other sources because we can easily remove obvious non CC BY content from our CC BY dataset.
The content will no longer be labelled as "CC BY" data, it will just be on our website labelled as "no license" or "unknown license" or whatever else that makes it clear it's not for reuse (or if one chooses to reuse, it will be at their own risk).
We will still encourage everyone to cite the sources, to give attribution where it's due. We will also still remove content from Tatoeba if the original authors ask us to do so. But we don't have to worry about license compatibility.
Also, please be aware that is intellectual work on creating a collection of sentences even if each sentence of the collection are individually free of intellectual property.
For instance you can create a list of "1000 most common sentences". If you take each sentence individually, you can't argue that you own these sentences, millions of people have used them before you. But if someone was to take your exact list and publish in their language learning website as "1000 sentences for beginners", then they basically ripped off your work. Because it is intellectual work to come up with criteria for selecting what is more common.
In the case of Chuck Norris facts, they built a website, they have people accepting or rejecting the submissions, they set up a whole infrastructure to provide collect and share these jokes. It *is* work, is it intellectual value. They cannot claim that each joke belongs to them, but they can claim that the collection of jokes belongs to them. Not only that, but they have ads on their website, so it is also money. Personally, when money is involved, I don't want to make any sort of optimistic assumption.
I hope this clarifies things.
Thanks for taking your time to present a detailed account. It's not like I had zero understanding, but having an explanation laid out is quite a bit helpful; nice of you to write one up.
This analogy doesn't seem to apply to Tatoeba as it is since sentences are added individually though? I mean, I *could* add 1000 common sentences from such a list as you describe, and this couldn't be copyright violation since nobody owns any of those sentences individually. It seems there's no violation as long as I don't combine them into a list similar to the original one; only then am I arguably using someone else's work.
I have to agree that adding those sentences under an alt account set up specifically for that purpose makes it too obvious. But if a regularly contributing account/user, such as me, added common sentences from a certain source along with other sentences so their existence as a relatively large set wouldn't be visible, it would be realistically no problemo, hmm? Since it only becomes a real problem when you can easily trace those sentences and conceive of them as a single list, rather than stumble upon invididual items that do not suggest a bigger list exists. That is, if those sentences aren't of the sort that makes you think right away they might come from a single author, which Internet memes are not as a anyone may want to create more phrases with the same jocular pattern.
Though I realize you're probably posing your question hypothetically, it sounds to me like "Who cares whether it's wrong as long as we can come up with a scheme for getting away with it?" Aside from the ethical problem, it seems like a bad idea to rely on our speculations, as non-lawyers, about what might or might not get us in trouble with the law. In any case, I don't think Tatoeba is so hard up for sentences that we need to steal them.
> Who cares whether it's wrong as long as we can come up with a scheme for getting away with it?
You mean legally wrong. Common jokes shouldn't be copyrightable if we're talking morally, but oh well.
Let me add a pinch of explanation because I think it's important. The short answer is: no, it won't work. It could even worsen things, because a lawyer could accuse you of being fully aware of your copyright infringement and trying to hide it (aggravating circumstance). It's like saying "I laundry my dirty 2 millions, but only 2 000 by 2 000. The authorities won't catch me so it's safe."
The scenario is simple. Suppose I own the copyright on contents of some sort, here, sentences. I build a script to search the web for potential copyright infringement (if I have a lot of money, I pay for such script). Above a certain threshold (chosen more or less arbitrary), it is decided that "potential" becomes "very likely". The two common ways to deal from here are the following:
- I'm relatively a nice person so I contact you, explain the problem and ask you to kindly deal with the situation.
- I don't care at all, and demand the removal of your copyrighted content otherwise I'll send you my lawyer.
That's how it works on Youtube for example. Copyright owners won't care who the content creators are, they'll strike their videos, period. You're just a science communicator who makes instructive videos? Well, too bad you used 17 seconds of Black Eyed Peas. Talking about life in the U.S.? Well don't use Mario Main Theme.
Sometimes the content creator will make a phone call, find a reasonable people and a happy outcome (or they belong to a powerful network that can negotiate...), sometimes they won't.
And don't think that popular stuff, the folklore, or something that everybody know is safe. That's how copyright scammers work on Youtube. They claim ownership on any piece of content that is not copyrighted to be able to strike videos and get the corresponding share of the money. Sometimes they even try to claim copyright over content that were officially release for free-use...
Of course, there's no need to be paranoid. But being aware that some people are merciless help being cautious. Since Tatoeba ask people to add their own sentences, it's normal to expect that people will be cautious about adding copyrighted (or possibly copyrighted) content.
> This analogy doesn't seem to apply to Tatoeba as it is since sentences are added individually though? I mean, I *could* add 1000 common sentences from such a list as you describe, and this couldn't be copyright violation since nobody owns any of those sentences individually. It seems there's no violation as long as I don't combine them into a list similar to the original one; only then am I arguably using someone else's work.
Have you ever heard about "Database right"?: https://en.wikipedia.org/wiki/Database_right
As for changing back "Tom" to "Chuck Norris", I would recommend against it. Keeping Tom is okay, these are jokes that has been customized for Tatoeba.
On a side note, I would like to point out that before Tom, there was Christopher Columbus:
> I would like to point out that before Tom, there was Christopher Columbus:
I don't think this is true.
In English, the first "Columbus" sentence is #35544. The first "Tom" sentence was #1780. There were 34 other "Tom" sentences with lower numbers under 35544.
There are 652 sentences with Tom under #467000 and 18 "Columbus" sentences.
The first "Christopher Columbus" sentence is #536592. There are 671 "Tom" sentences with numbers under this.
I didn't mean that there was no sentence with Tom before. I meant that when it comes to making jokes with a protagonist who has super human abilities, this trend appeared in Tatoeba with "Christopher Columbus" before "Tom".
> Alternatively, someone could try to negotiate with the owners of https://chucknorrisfacts.net/
These jokes about Chuck Norris are basically modern folklore, I feel like it would be super weird for someone to claim any copyright on them. I'm sure chucknorrisfacts.net is not the only website that collects this folklore. Are you sure they were copied from there and they're not part of Tom_Facts's own collection? The jokes are literally everywhere.
I remember reading jokes about him in this style long before chucknorrisfacts.net was created (year 2017, according to whois lookup).
Täältä saattaisi löytyä muutama CC-lisensoituna: https://commons.wikimedia.org/w...related_images
Just because the picture of a sentence is CC licensed, doesn't make the sentence itself CC licensed.
This file seems to be CC-BY -licensed: https://commons.wikimedia.org/w...indow_sign.jpg
The title of the file contains the sentence. (The picture does, too.) The title of the file is a part of the file. Hence, the title seems to be CC-licensed, too. Or am I missing something?
Reusing sentences that can be seen in pictures that were published under CC BY can give us a safety net, but it relies on the assumption that whoever uploaded these pictures was knowledgeable enough about intellectual property.
If @Tom_Facts (or anyone) can justify that there is a legit, creative or intellectually demanding process behind finding and converting Chuck Norris facts into Tom facts, then (to me at least) it would be a better defense than saying "I extracted these jokes from CC BY pictures".
If the process is "I'm browsing random Chuck Norris facts on some website(s) and I take those that I like, replace the name with Tom and add them to Tatoeba", the intellectual added value is... too minimal.
En tunne Ranskan tekijänoikeuslainsäädäntöä; paras suomalainen lähde, jonka löysin, on tämä: http://www.kysy.fi/kysymys/voik...kijanoikeuksia
Siinä ei mainita, että vitsien teoskynnyksestä olisi oikeustapauksia. Monien tekijänoikeus on iän takia hävinnyt ja monet ovat osa yleistä kansanperinnettä. ”Pikku Kalle”-vitsit ja ”Suomalainen, ruotsalainen ja norjalainen”-vitsit kuulunevat näihin luokkiin, mutta ”Chuck Norris”-vitsit saattavat olla liian tuoreita.
(Tämä on yksi esimerkki siitä, miten tekijänoikeus käsitteenä haittaa ja hankaloittaa luovaa, hyödyllistä ja osittain myös tieteellistä työtä, aivan kuten monopolit yleensä ovat haitallisia. Tekijänoikeudet tulisi poistaa tai ainakin heikentää niitä suuresti. Ne ovat vahingollisia kulttuurille ja tieteelle.)
The problem isn't about whether a single joke is protected by copyright or not (it most likely isn't as are the majority of the sentences on Tatoeba IMHO.)
The problem is scraping a lot of jokes/sentences from another database where we don't know its license because databases can be considered as creative works by themselves independent of whether their items are creative works or just facts (i.e. protected by copyright or not).
Now the question is whether compiling a list of jokes is considered as a creative work. I'm pretty sure it is in Europe which has rather strict database laws (see the Wikipedia article I've mentioned in another message). But even in the US which as far as I know don't have such strict laws, it seems to be likely as the following passage shows:
"An example of a database that is protected as a compilation would be a database of selected quotations from U.S. Presidents. The individual quotations themselves may or may not be subject to copyright protection. However, the selection of the quotations involves enough original, creative expression that it is protected by copyright. Therefore, a database of quotations will be protected by copyright as a compilation even though some of the quotations are not protected." (from https://www.bitlaw.com/copyright/database.html )
So as long as nobody can prove reliable that the owner of chucknorrisfacts.net is ok with scraping many or all of the jokes, I think it's better to stop adding them.
Olen samaa mieltä siitä, että kokoelma kokonaisuudessaan tai merkittävin osin saattaa olla kopiointioikeuksien suojaamaa ja sen ottaminen käyttöön ilman lupaa on epäkohteliasta.
First, please take the time to read fully my reply to Ooneykcall:
> I remember reading jokes about him in this style long before
> chucknorrisfacts.net was created (year 2017, according to whois lookup).
Note that the domain chucknorrisfacts.net may have been registered in 2017, but the website itself existed since 2008 (according to their footer). It was probably under another domain before 2017.
If the "2008" indicated in the footer is a lie and they exist only since 2017, then they did a great job at replicating website design from 15 years ago...
> Are you sure they were copied from there and they're not part of Tom_Facts's
> own collection?
@Tom_Facts did not explicitly tell me they copied from chucknorrisfacts.net and I do not possess secret government agency surveillance tools, nor do I possess mind reading powers. So no, I'm not sure.
But when I check some of the sentences, I'm getting results like these: https://imgur.com/a/zzLQJQL. I can't help but being very suspicious that a large part of those jokes are copied from chucknorrisfacts.net, as it happens to the be common denominator in these results.
I'm very well aware that these jokes are everywhere. I know many of these jokes, I laughed at many of these jokes, and I wished just as much as many people here that we didn't have to worry, ever, about copyright or licensing or intellectual property.
But that won't change the fact that inserting these jokes on a large scale poses a legal risk for Tatoeba. The more are being added, the more the risk is growing.
If some content has been copied and re-copied, it wouldn't necessarily and magically become CC BY compatible. So even if @Tom_Facts didn't copy directly from chucknorrisfacts.net, but from a website that itself copied from chucknorrisfacts.net, then it can still be a problem.
And I will insist again: the main problem is not that these jokes are published on Tatoeba, the main problem is that they are *incorrectly licensed under CC BY*.
Is there any platform or blog that publishes Chuck Norris facts under a Creative Commons license? No. There's none. So we cannot be thinking "Well, everyone else uses these jokes, why can't we?". We have different legal constraints, that's why we can't.
Now you are lucky, Andreas (aka. @rumpelstilzchen) has volunteered to tackle the issue https://github.com/Tatoeba/tatoeba2/issues/1659. But until it is ready and deployed, please (everyone), don't copy more jokes and memes into Tatoeba. Don't translate them if you see any that might be legally shady. Ask people to stop if you see anyone doing it. Just wait till we have the proper features in place, or just create your own jokes instead. I have no time to play the cop so I count on everyone's cooperation. Thanks.
What's New on Tatoeba? - Your weekly recap °10
A small week it was. Some discussions here, some updates of code there, everybody needs a rest sometimes. Stay tuned next week for nice new features!
ON THE WALL
※ hamsolo474 started a thread leading to discussing various topics, such as the quality of a sentence and good search results https://tatoeba.org/fra/wall/show_message/34604
※ gillux performed a new UX test, this time on a user very familiar with Tatoeba. You can also help us by performing this kind of test and write a summary :) https://tatoeba.org/fra/wall/show_message/34618
※ Ooneykcall started a thread that led to discuss copyright and licensing https://tatoeba.org/fra/wall/show_message/34630
※ Ibdx discussed about adding more sentences containing words whose meanings are often looked for https://tatoeba.org/fra/wall/show_message/34645
CONTRIBUTIONS AND LANGUAGES
※ 15 689 sentences added this week. You can check daily activity on this page https://tatoeba.org/eng/contrib...ivity_timeline
※ This week, two languages were added, bringing the number of languages on Tatoeba to 355! Thanks to Ricardo14 and gillux for coordinating this.
On zorgzikhnit's request, Bislama has been added https://en.wikipedia.org/wiki/Bislama
On MarijnKp's request, Saterland Frisian has been added https://en.wikipedia.org/wiki/Saterland_Frisian
※ Some of our members helped translating the website (crediting using Transifex usernames). In arbitrary order:
gorkaazk, elenacristina260, herrsilen, SAmiri, Les90, yanis.batura, Aiji, gillux, arh, MarijnKp, RyckRichards, fjay69, Gulo_Luscus, robin0van0der0vliet, shekitten, Mohsin_Ali, Yorwba, easononizuka, maxine22zhang, Silja, Thanuir, Guybrush88, 58karel, michel.smts2, robin0van0der0vliet, sabretou, small_snow
If you'd like to help to the development of Tatoeba, report issues, or are just curious, have a look at the GitHub repository: https://github.com/Tatoeba/tatoeba2
If you want to help us translate the website to your language, you can join us on Transifex: https://www.transifex.com/tatoe...ite/dashboard/ and check this article on the wiki https://en.wiki.tatoeba.org/art...ce-translation
Fun fact: Mary Poppins’ “supercalifragilisticexpialidocious” actually appears in some dictionaries.
Last week recap: https://tatoeba.org/fra/wall/show_message/34592
See this recap on the blog: https://blog.tatoeba.org/2020/0...kly-recap.html