** Is it legal to use CC-BY sentences on Tatoeba.org? **
There seems to be a discrepancy on this page.
Trang says, "Anything that basically doesn't say "You can do absolutely whatever you want with this" is NOT compatible with CC-BY."
However, later she says, "Anything that is under CC-BY is compatible with CC-BY. "
My interpretation is that since people use our data under a CC-BY license, that we can't use other people's CC-BY material since those who use our data can't also do a CC-BY for the other source. Trang's first statement seems to indicate this, since people who release material under CC-BY require that they be given credit for the material and do not grant the right to "do absolutely whatever you want with this."
I've updated the article.
The line you quoted was more of a simplified guideline for people who are not too familiar with licenses. It was also written back in 2011 when we overall had much less experience with licenses. It is obviously not a precise legal statement.
In general, you should avoid making interpretations out of blog posts when it comes to licenses. You should instead read the license text and make your own interpretation based on that text, as it is the original source.
I think you're mistaken and that you should ask a lawyer first if you are going to encourage members to take someone else's CC-BY material and put it into the Tatoeba Corpus which is distributed under it's own CC-BY license, which only requires attribution to tatoeba.org.
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
CC-BY is literally designed specifically to work like this. Reusing CC-BY content and making it also CC-BY is ideal. There are more restrictive licenses (not permitted on Tatoeba) that REQUIRE this.
Some licensors choose the BY license, which requires attribution to the creator as the only condition to reuse of the material.
How can someone using the Tatoeba Corpus and properly crediting the Tatoeba Project, know that they also need to give credit to a third party as well? If you don't specifically give credit to the third party, then you would be violating the third party's CC-BY license, I think.
Yes, someone publishing Tatoeba data would have to figure out how to credit the third parties (if they do not want to breach the terms of CC licenses), too, which would be quite challenging.
As shekitten said, CC BY is designed for reuse. No one in their right mind would choose a CC BY license if they didn't want their content to be reused somewhere else. So it is nonsense to say that you cannot reuse CC BY content into another CC BY content.
What you are pointing out is that we may not be doing the attribution properly. So let's break it down:
1) You must give appropriate credit
That is done by adding a comment on the sentence with a link to the original source. I believe that is appropriate enough, but I suppose the best way is to be sure is to contact the author to confirm. And indeed, we may not have been very diligent on that, so we could add it as a guideline that when copying from another CC BY content, one should always try to contact the author to ask about attribution. Now if the author doesn't reply, I think we're still very, very safe sticking to adding a comment with a link to the original source.
2) provide a link to the license
This is done indirectly: since we provide a link to the source, the source will have the link to the license. In the comment on Tatoeba, I think the mention to the license name and version is fine and there is no need to additionally put a link to the license itself.
3) and indicate if changes were made
This is done with the logs: whenever someone edits the sentence, the logs indicate when and how the sentence has been modified.
With all of that, I think we're okay.
Now I understand very well that one flaw of CC BY is that if you want to be 100% sure that you're doing attribution properly, it can be very tedious work, because indeed, you would have drag along all the attributions from previous reuses. And that is actually why we agreed to introduce CC0 when Common Voice approached us. We know how painful that is with CC BY, and we know that CC0 would alleviate this pain. With CC0, there is no need to worry about this whole trail of attribution when content is reused in content that is then again reused, and again reused.
In our case with CC BY though, we are following some common sense and we assume that someone who shares their work under CC BY is okay with indirect attribution. Meaning that if I create CC BY content and you reuse my CC BY content into your own CC BY content and shekitten reuses your CC BY content, I'm okay if shekitten only gives attribution to you and not to me. Because indirectly, I'm still being given attribution (through you).
I think it is a fairly reasonable assumption. But if for some reason, we are copying from someone who is not okay with this concept of indirect attribution, then we can figure out something. We can readapt the way we give credit by adding some warning on the Downloads page about the people who are not okay with indirect attribution so that projects that reuse our content will know that they need to mention these people. But again, we don't have to reject copied sentences from external CC BY sources right off the bat because it's really borderline paranoia to do so.
The link to the license should be direct. (The target website might vanish.) But luckily there already is a direct link to CC-BY license on the sentence page. I have no idea what happens if the original was licensed under a non-French version of the license; is the French license sufficient or should one also link to the original license?
Here is more information on what is proper attribution: https://wiki.creativecommons.or...mparison_chart
In particular, the legality is not a matter of what the author intended or wanted to accomplish, but rather of the license.
In any case, only the original source needs to be attributed, not any intermediate sources.
> is the French license sufficient or should one also link to the original license?
The French license would not be enough. Each version of CC BY should be considered as a different license, even if they are very similar.
For the context, this topic was brought up because of this sentence:
So going from this example, if we want to be absolutely strict about attribution, then we would have to ask shekitten to also post a link to the CC BY 4.0 license (not just mention the license name). And we may have to do other things in order to be 99.999999% safe legally speaking.
> In any case, only the original source needs to be attributed
It's very clear that when possible the original source needs to be attributed. But when content gets mixed and remixed, it can be difficult and confusing to find out who is the very first author. And in such cases, I think we are still safe if we are only attributing to the intermediate source. No one is going to sue Tatoeba because we didn't give them attribution directly, but instead gave attribution to someone who reused their content. They will most likely just let us know and we can update the information when we find out that we were referencing an intermediate source.
My whole paragraph about "indirect attribution" was mostly to argue on the fact that there is no imminent danger by referencing an intermediate source unknowingly and therefore we do not need to reject every sentence copied from other CC BY sources (concretely, it doesn't make sense to mark shekitten's Láaden sentences in red).
CK still thinks that no matter what, it is wrong to let contributors copy into Tatoeba sentences from other CC BY sources and that I risk being sued for it. In other word, that we should completely forbid people from copying sentences from other CC BY content.
I think that enforcing such a rule would be unreasonable. I know there is a risk and I know we are not handling the whole legal aspect perfectly, but that's normal considering that we have grown on a very scarce (nearly non-existent) budget and considering that the topic of intellectual property in the internet era is still a fairly new territory.
If anyone would like to help out and investigate on the safety of allowing Tatoeba members to copy CC BY sentences and on what else we can do at this stage to avoid any risk of lawsuit, I would be infinitely grateful. On my side, I cannot put any more effort into this topic.
I think the issue of how to attribute the sentences - whether to just attribute Tatoeba or to directly attribute the original source - is ultimately up to whoever is making use of Tatoeba data to resolve.
It's indisputable that it is possible to release a CC-BY work (such as Tatoeba) that makes use of other CC-BY works (such as individual sentences), and I'm not the first person who has ever done this. From our end, all we have to do is attribute the original source.
From the POV of the downstream service, that's for them to resolve. I would say the most legal and ethical practice is to attribute both, which is entirely within the realm of possibility on their end and is something they should already be doing.
Personally, I think that there is nothing morally wrong with breaking copyright laws, as they hurt humanity in general and the mission of Tatoeba in particular.
I guess the risk to be sued due to the content of Tatoeba is tiny. I think the risk to be sued on a legally sound basis is even tinier, since even copying entire single sentences from a book while breaking the order of sentences is not obviously wrong, AFAIK. (This is not legal advice.)
I read that some computational linguistics researchers, when they want to share a corpus, put the sentences in a random order so that the original work can not be recovered from there, but their approach is ad hoc and has not been tested in court. I do not remember the source anymore.
I would suggest linking to the source and the license when using CC-BY-licensed content. The effort is not big when one should link to the source anyway and the source should link to the license, so one can simply copy the link from there.
I agree that there's nothing morally wrong with breaking copyright laws, but I still think there is something morally questionable about not attributing a source - particularly in cases where you are making money off of that source.
* Here are 2 statements by shekitten and my comments.
> From the POV of the downstream service, that's for them to resolve ...
I think that when the Tatoeba Project distributes their corpus with the understanding that it's OK to use it if attribution is given to the Tatoeba Project, the implication is that it is free to use with no other restrictions.
> ... but I still think there is something morally questionable about not attributing a source ...
I think when a person releases their material under a CC-BY license that they do it with the expectation that they will receive attribution when their material is reused. I don't think that they expect it to be reused without attribution as suggested in another comment above.
So, regardless of the legal aspect, I would say it's morally wrong to reuse CC-BY material that is going to be redistributed without the required attribution that the person who chose the CC-BY license wants and expects.
* Additional comments.
If it is indeed possible to add CC-BY material to the Tatoeba Corpus and redistribute it under Tatoeba's own CC-BY license, then perhaps TRANG should find all the parallel corpora with CC-BY licenses and import them all into the Tatoeba Corpus, like she did with the public domain Tanaka Corpus.
Also, if this is true, would it mean that anyone who reuses bilingual pairs from http://www.manythings.org/anki in another project not need to to give attribution to the Tatoeba Project anymore?
I think that it would be a lot better and safer for us to not include CC-BY licensed material by others in the Tatoeba Corpus. We have a number of native speakers who can easily add their own material without needing to reuse (steal?) CC-BY material.
There is no "stealing" with copyright breaches, and there is no copyright breach on Tatoeba users' side, here, so please do not use needlessly inflammatory language.
I agree that the current situation is inconvenient for people who would like to republish Tatoeba content. If someone wanted to do that, they would have to figure out a way of identifying which sentences need further attribution, and then either provide that or exclude those sentences.
It would be helpful to those people if the sentences requiring further attribution would be marked somehow, perhaps with a different "license" option: CC-BY license with additional attribution required or something like that as a licensing option, maybe.
I want to stress that no one said that CC BY material should not be given attribution. It is very, very clear that we should give attribution when reusing CC BY material.
But you are apparently advocating for "viral" attribution and you are also advocating to forbid people from mixing CC BY content just because those who reuse the remix might not give proper attribution. I don't know if you realize that this point of view is also morally questionable and is a creativity killer...
We are going to do our best to be as fair as possible to everyone who is a content creator, but we cannot take measures that are disconnected from reality.
> perhaps TRANG should find all the parallel corpora with CC-BY licenses
> and import them all into the Tatoeba Corpus
I would but I'm not interested in quantity. Tatoeba still has too many flaws and there's really, really a lot of challenges to solve on a software engineering level, on a UI/UX level, on an organizational level... Having more sentences is at the very bottom of my priorities. We are not scalable enough for the corpus to grow much faster than ~2000 sentences per day.
> Also, if this is true, would it mean that anyone who reuses bilingual pairs
> from http://www.manythings.org/anki in another project not need to to
> give attribution to the Tatoeba Project anymore?
Yes, those who reuse the Anki bilingual pairs do not need to give attribution to Tatoeba. These subsets have been processed and reorganized in a different manner than what Tatoeba originally provides. There has been actual work put into reorganizing the data and it's enough work that Tatoeba does not need to be attributed anymore. Giving attribution to manythings.org alone would be completely fine and I would personally find it outrageous if people were forced to also give attribution to Tatoeba.
I still think you're wrong about not needing to credit the person who has released something as CC-BY, if it is then distributed as CC-BY by someone else. I still think that you should not be distributing someone else's CC-BY material to others under your own license, and assuming that it is then OK for others to use that material if they give your website credit, but not give credit to the original person who released material under a CC-BY license. It think this is morally wrong, not following the spirit of CC-BY and a copyright infringement. A copyright owner has the right to control distribution of his/her material. If they choose to distribute their material for just the cost of attribution, their rights are being violated if that is not done.
If that is really what you believe, why do you only give attribution to Tatoeba when you reuse the Tatoeba corpus in your projects instead of giving attribution to every contributor individually?
Or why haven't you protested against the release of the Tatoeba corpus since the beginning?
Each contributor has provided their sentences to Tatoeba under CC BY (or CC0 since early 2019), and Tatoeba is only packaging them into one big corpus.
> If that is really what you believe, why do you only give attribution to Tatoeba when you reuse the Tatoeba corpus in your projects instead of giving attribution to every contributor individually?
I thought that the message on the downloads page meant that developers could use the Tatoeba Corpus if they credited tatoeba.org. Even now, the message on the downloads page implies that.
Actually, most of my projects have a direct link to each sentence's page on tatoeba.org and the username of the owner of sentence. I think perhaps a couple of projects only include a link to the page on tatoeba.org.
The only project that didn't have that was http://www.manythings.org/anki. I have corrected that today, by inserting one extra field on each line to include attribution. This does add a bit to the file sizes, but shouldn't really bother people too much.
You can see a quick screenshot, so you don't need to download a file.
I have agreed that all the sentences I post on Tatoeba now "belong" to Tatoeba. That said, whatever Tatoeba wants, needs to do with them, I'll have no objection. I truly believe that it's every single user's feeling. :)
Tatoeba is attributing our CC-BY content by our username, and then doing the standard thing that people do with CC-BY work: reusing it with attribution and releasing the whole thing as CC-BY. Using a CC-BY license is a way of giving forwards permission to anyone who wants to reuse your work.
If you really want to be within the letter and arguably the spirit of CC-BY, you should be attributing individual contributors. This is what people are generally supposed to be doing when they reuse content from Wikipedia as well.
It says: "for any text to which you hold the copyright, by submitting it, you agree to license it under the Creative Commons Attribution License 2.0 (fr)."
The whole section about intellectual property describes that.
But the two relevant paragraphs are:
"L’infrastructure technique de Tatoeba utilise par défaut, pour la contribution de phrases textuelles, la licence Creative Commons Attribution 2.0 France (CC-BY 2.0 FR)."
= This is saying that we use CC BY 2.0 FR as the default license.
"Lors de la contribution, sur notre Site Internet, d’une phrase dont vous êtes propriétaire, en votre qualité d’auteur·e, vous attribuez une licence à cette phrase."
= This is saying that when you contribute a sentence that is your own sentence, you are applying a license to this sentence.
If you combine these two paragraphs, the idea is that when someone submits a sentence to Tatoeba, by default, they license it under CC BY 2.0 FR.
Yes, providing attribution is the polite thing to do.
From the English CC-BY 2.0 legal code, https://creativecommons.org/lic...2.0/legalcode. This is just an example; one needs to read the relevant license of what is to be added to Tatoeba if one wants to be sure.
I am quoting or referencing the parts that might be problematic for Tatoeba, or that are otherwise good to know. Not a lawyer, not legal advice, and so on. I am not suggesting any particular way of going forward, here, just trying to figure out what the license exactly says. A native speaker or someone with background in law should go through the text, too.
From part 1, definitions:
"Collective Work" means a work, such as a periodical issue, anthology or encyclopedia, in which the Work in its entirety in unmodified form, along with a number of other contributions, constituting separate and independent works in themselves, are assembled into a collective whole. A work that constitutes a Collective Work will not be considered a Derivative Work (as defined below) for the purposes of this License.
"Derivative Work" means a work based upon the Work or upon the Work and other pre-existing works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which the Work may be recast, transformed, or adapted, except that a work that constitutes a Collective Work will not be considered a Derivative Work for the purpose of this License. For the avoidance of doubt, where the Work is a musical composition or sound recording, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered a Derivative Work for the purpose of this License.
Basically, this means that the sentence itself could be a part of collection, but any translations are derivative works, and any edits also create a derivative work. Any larger collection that includes both original CC-BY sentences and their modifications is a derivative work (I think).
This part is 3.b, i.e. rights given to for example Tatoeba project:
"to create and reproduce Derivative Works"
This is 3.d.
"to distribute copies or phonorecords of, display publicly, perform publicly, and perform publicly by means of a digital audio transmission Derivative Works. "
Recall that translations are derivative works. I think this implies that every translation of an outside sentences with the CC-BY license should attribute the original source, provide copyright notice (if any), and link to the relevant license. This is currently unfeasible on Tatoeba, since many user interfaces for translating do not suggest the original sentence is under specific attribution requirements.
This part is from under restrictions, 4.a.:
"...You may not offer or impose any terms on the Work that alter or restrict the terms of this License or the recipients' exercise of the rights granted hereunder. You may not sublicense the Work. You must keep intact all notices that refer to this License and to the disclaimer of warranties...."
I do not know if Tatoeba is sublicensing the sentences or translations thereof.
I do not know whether all the different CC-BY licenses are compatible enough with CC-BY 2.0 French that we do not "alter or restrict the terms of this License or the recipients' exercise of the rights granted hereunder". I suspect it would be better to use the same license as the original.
We also need to add any potential disclaimers of warranties from the source, if any are written there.
Restrictions, 4.b. This is important.
If you distribute, publicly display, [...] the Work or any Derivative Works or Collective Works, You must keep intact all copyright notices for the Work and give the Original Author credit reasonable to the medium or means You are utilizing by conveying the name (or pseudonym if applicable) of the Original Author if supplied; the title of the Work if supplied; to the extent reasonably practicable, the Uniform Resource Identifier, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and in the case of a Derivative Work, a credit identifying the use of the Work in the Derivative Work (e.g., "French translation of the Work by Original Author," or "Screenplay based on original Work by Original Author"). Such credit may be implemented in any reasonable manner; provided, however, that in the case of a Derivative Work or Collective Work, at a minimum such credit will appear where any other comparable authorship credit appears and in a manner at least as prominent as such other comparable authorship credit.
This outlines exactly what information should be included when giving credit to the creator.
First, how to give credit due. It should be "reasonable to the medium or means You are utilizing";
"Such credit may be implemented in any reasonable manner; provided, however, that in the case of a Derivative Work or Collective Work, at a minimum such credit will appear where any other comparable authorship credit appears and in a manner at least as prominent as such other comparable authorship credit.".
Second, the contents of the credit notice. It should include the name of the Original Author if supplied; the title of the Work if supplied; to the extent reasonably practicable, the Uniform Resource Identifier, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and in the case of a Derivative Work, a credit identifying the use of the Work in the Derivative Work (e.g., "French translation of the Work by Original Author.")."
Section 7, termination:
"This License and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this License."