Wall (5,574 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
13 minutes ago
an hour ago
an hour ago
4 hours ago
12 hours ago
13 hours ago
20 hours ago
** Is it legal to use CC-BY sentences on Tatoeba.org? **
There seems to be a discrepancy on this page.
Trang says, "Anything that basically doesn't say "You can do absolutely whatever you want with this" is NOT compatible with CC-BY."
However, later she says, "Anything that is under CC-BY is compatible with CC-BY. "
My interpretation is that since people use our data under a CC-BY license, that we can't use other people's CC-BY material since those who use our data can't also do a CC-BY for the other source. Trang's first statement seems to indicate this, since people who release material under CC-BY require that they be given credit for the material and do not grant the right to "do absolutely whatever you want with this."
I've updated the article.
The line you quoted was more of a simplified guideline for people who are not too familiar with licenses. It was also written back in 2011 when we overall had much less experience with licenses. It is obviously not a precise legal statement.
In general, you should avoid making interpretations out of blog posts when it comes to licenses. You should instead read the license text and make your own interpretation based on that text, as it is the original source.
I think you're mistaken and that you should ask a lawyer first if you are going to encourage members to take someone else's CC-BY material and put it into the Tatoeba Corpus which is distributed under it's own CC-BY license, which only requires attribution to tatoeba.org.
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
CC-BY is literally designed specifically to work like this. Reusing CC-BY content and making it also CC-BY is ideal. There are more restrictive licenses (not permitted on Tatoeba) that REQUIRE this.
Some licensors choose the BY license, which requires attribution to the creator as the only condition to reuse of the material.
How can someone using the Tatoeba Corpus and properly crediting the Tatoeba Project, know that they also need to give credit to a third party as well? If you don't specifically give credit to the third party, then you would be violating the third party's CC-BY license, I think.
Yes, someone publishing Tatoeba data would have to figure out how to credit the third parties (if they do not want to breach the terms of CC licenses), too, which would be quite challenging.
As shekitten said, CC BY is designed for reuse. No one in their right mind would choose a CC BY license if they didn't want their content to be reused somewhere else. So it is nonsense to say that you cannot reuse CC BY content into another CC BY content.
What you are pointing out is that we may not be doing the attribution properly. So let's break it down:
1) You must give appropriate credit
That is done by adding a comment on the sentence with a link to the original source. I believe that is appropriate enough, but I suppose the best way is to be sure is to contact the author to confirm. And indeed, we may not have been very diligent on that, so we could add it as a guideline that when copying from another CC BY content, one should always try to contact the author to ask about attribution. Now if the author doesn't reply, I think we're still very, very safe sticking to adding a comment with a link to the original source.
2) provide a link to the license
This is done indirectly: since we provide a link to the source, the source will have the link to the license. In the comment on Tatoeba, I think the mention to the license name and version is fine and there is no need to additionally put a link to the license itself.
3) and indicate if changes were made
This is done with the logs: whenever someone edits the sentence, the logs indicate when and how the sentence has been modified.
With all of that, I think we're okay.
Now I understand very well that one flaw of CC BY is that if you want to be 100% sure that you're doing attribution properly, it can be very tedious work, because indeed, you would have drag along all the attributions from previous reuses. And that is actually why we agreed to introduce CC0 when Common Voice approached us. We know how painful that is with CC BY, and we know that CC0 would alleviate this pain. With CC0, there is no need to worry about this whole trail of attribution when content is reused in content that is then again reused, and again reused.
In our case with CC BY though, we are following some common sense and we assume that someone who shares their work under CC BY is okay with indirect attribution. Meaning that if I create CC BY content and you reuse my CC BY content into your own CC BY content and shekitten reuses your CC BY content, I'm okay if shekitten only gives attribution to you and not to me. Because indirectly, I'm still being given attribution (through you).
I think it is a fairly reasonable assumption. But if for some reason, we are copying from someone who is not okay with this concept of indirect attribution, then we can figure out something. We can readapt the way we give credit by adding some warning on the Downloads page about the people who are not okay with indirect attribution so that projects that reuse our content will know that they need to mention these people. But again, we don't have to reject copied sentences from external CC BY sources right off the bat because it's really borderline paranoia to do so.
The link to the license should be direct. (The target website might vanish.) But luckily there already is a direct link to CC-BY license on the sentence page. I have no idea what happens if the original was licensed under a non-French version of the license; is the French license sufficient or should one also link to the original license?
Here is more information on what is proper attribution: https://wiki.creativecommons.or...mparison_chart
In particular, the legality is not a matter of what the author intended or wanted to accomplish, but rather of the license.
In any case, only the original source needs to be attributed, not any intermediate sources.
> is the French license sufficient or should one also link to the original license?
The French license would not be enough. Each version of CC BY should be considered as a different license, even if they are very similar.
For the context, this topic was brought up because of this sentence:
So going from this example, if we want to be absolutely strict about attribution, then we would have to ask shekitten to also post a link to the CC BY 4.0 license (not just mention the license name). And we may have to do other things in order to be 99.999999% safe legally speaking.
> In any case, only the original source needs to be attributed
It's very clear that when possible the original source needs to be attributed. But when content gets mixed and remixed, it can be difficult and confusing to find out who is the very first author. And in such cases, I think we are still safe if we are only attributing to the intermediate source. No one is going to sue Tatoeba because we didn't give them attribution directly, but instead gave attribution to someone who reused their content. They will most likely just let us know and we can update the information when we find out that we were referencing an intermediate source.
My whole paragraph about "indirect attribution" was mostly to argue on the fact that there is no imminent danger by referencing an intermediate source unknowingly and therefore we do not need to reject every sentence copied from other CC BY sources (concretely, it doesn't make sense to mark shekitten's Láaden sentences in red).
CK still thinks that no matter what, it is wrong to let contributors copy into Tatoeba sentences from other CC BY sources and that I risk being sued for it. In other word, that we should completely forbid people from copying sentences from other CC BY content.
I think that enforcing such a rule would be unreasonable. I know there is a risk and I know we are not handling the whole legal aspect perfectly, but that's normal considering that we have grown on a very scarce (nearly non-existent) budget and considering that the topic of intellectual property in the internet era is still a fairly new territory.
If anyone would like to help out and investigate on the safety of allowing Tatoeba members to copy CC BY sentences and on what else we can do at this stage to avoid any risk of lawsuit, I would be infinitely grateful. On my side, I cannot put any more effort into this topic.
I think the issue of how to attribute the sentences - whether to just attribute Tatoeba or to directly attribute the original source - is ultimately up to whoever is making use of Tatoeba data to resolve.
It's indisputable that it is possible to release a CC-BY work (such as Tatoeba) that makes use of other CC-BY works (such as individual sentences), and I'm not the first person who has ever done this. From our end, all we have to do is attribute the original source.
From the POV of the downstream service, that's for them to resolve. I would say the most legal and ethical practice is to attribute both, which is entirely within the realm of possibility on their end and is something they should already be doing.
Personally, I think that there is nothing morally wrong with breaking copyright laws, as they hurt humanity in general and the mission of Tatoeba in particular.
I guess the risk to be sued due to the content of Tatoeba is tiny. I think the risk to be sued on a legally sound basis is even tinier, since even copying entire single sentences from a book while breaking the order of sentences is not obviously wrong, AFAIK. (This is not legal advice.)
I read that some computational linguistics researchers, when they want to share a corpus, put the sentences in a random order so that the original work can not be recovered from there, but their approach is ad hoc and has not been tested in court. I do not remember the source anymore.
I would suggest linking to the source and the license when using CC-BY-licensed content. The effort is not big when one should link to the source anyway and the source should link to the license, so one can simply copy the link from there.
I agree that there's nothing morally wrong with breaking copyright laws, but I still think there is something morally questionable about not attributing a source - particularly in cases where you are making money off of that source.
* Here are 2 statements by shekitten and my comments.
> From the POV of the downstream service, that's for them to resolve ...
I think that when the Tatoeba Project distributes their corpus with the understanding that it's OK to use it if attribution is given to the Tatoeba Project, the implication is that it is free to use with no other restrictions.
> ... but I still think there is something morally questionable about not attributing a source ...
I think when a person releases their material under a CC-BY license that they do it with the expectation that they will receive attribution when their material is reused. I don't think that they expect it to be reused without attribution as suggested in another comment above.
So, regardless of the legal aspect, I would say it's morally wrong to reuse CC-BY material that is going to be redistributed without the required attribution that the person who chose the CC-BY license wants and expects.
* Additional comments.
If it is indeed possible to add CC-BY material to the Tatoeba Corpus and redistribute it under Tatoeba's own CC-BY license, then perhaps TRANG should find all the parallel corpora with CC-BY licenses and import them all into the Tatoeba Corpus, like she did with the public domain Tanaka Corpus.
Also, if this is true, would it mean that anyone who reuses bilingual pairs from http://www.manythings.org/anki in another project not need to to give attribution to the Tatoeba Project anymore?
I think that it would be a lot better and safer for us to not include CC-BY licensed material by others in the Tatoeba Corpus. We have a number of native speakers who can easily add their own material without needing to reuse (steal?) CC-BY material.
There is no "stealing" with copyright breaches, and there is no copyright breach on Tatoeba users' side, here, so please do not use needlessly inflammatory language.
I agree that the current situation is inconvenient for people who would like to republish Tatoeba content. If someone wanted to do that, they would have to figure out a way of identifying which sentences need further attribution, and then either provide that or exclude those sentences.
It would be helpful to those people if the sentences requiring further attribution would be marked somehow, perhaps with a different "license" option: CC-BY license with additional attribution required or something like that as a licensing option, maybe.
I want to stress that no one said that CC BY material should not be given attribution. It is very, very clear that we should give attribution when reusing CC BY material.
But you are apparently advocating for "viral" attribution and you are also advocating to forbid people from mixing CC BY content just because those who reuse the remix might not give proper attribution. I don't know if you realize that this point of view is also morally questionable and is a creativity killer...
We are going to do our best to be as fair as possible to everyone who is a content creator, but we cannot take measures that are disconnected from reality.
> perhaps TRANG should find all the parallel corpora with CC-BY licenses
> and import them all into the Tatoeba Corpus
I would but I'm not interested in quantity. Tatoeba still has too many flaws and there's really, really a lot of challenges to solve on a software engineering level, on a UI/UX level, on an organizational level... Having more sentences is at the very bottom of my priorities. We are not scalable enough for the corpus to grow much faster than ~2000 sentences per day.
> Also, if this is true, would it mean that anyone who reuses bilingual pairs
> from http://www.manythings.org/anki in another project not need to to
> give attribution to the Tatoeba Project anymore?
Yes, those who reuse the Anki bilingual pairs do not need to give attribution to Tatoeba. These subsets have been processed and reorganized in a different manner than what Tatoeba originally provides. There has been actual work put into reorganizing the data and it's enough work that Tatoeba does not need to be attributed anymore. Giving attribution to manythings.org alone would be completely fine and I would personally find it outrageous if people were forced to also give attribution to Tatoeba.
I still think you're wrong about not needing to credit the person who has released something as CC-BY, if it is then distributed as CC-BY by someone else. I still think that you should not be distributing someone else's CC-BY material to others under your own license, and assuming that it is then OK for others to use that material if they give your website credit, but not give credit to the original person who released material under a CC-BY license. It think this is morally wrong, not following the spirit of CC-BY and a copyright infringement. A copyright owner has the right to control distribution of his/her material. If they choose to distribute their material for just the cost of attribution, their rights are being violated if that is not done.
If that is really what you believe, why do you only give attribution to Tatoeba when you reuse the Tatoeba corpus in your projects instead of giving attribution to every contributor individually?
Or why haven't you protested against the release of the Tatoeba corpus since the beginning?
Each contributor has provided their sentences to Tatoeba under CC BY (or CC0 since early 2019), and Tatoeba is only packaging them into one big corpus.
> If that is really what you believe, why do you only give attribution to Tatoeba when you reuse the Tatoeba corpus in your projects instead of giving attribution to every contributor individually?
I thought that the message on the downloads page meant that developers could use the Tatoeba Corpus if they credited tatoeba.org. Even now, the message on the downloads page implies that.
Actually, most of my projects have a direct link to each sentence's page on tatoeba.org and the username of the owner of sentence. I think perhaps a couple of projects only include a link to the page on tatoeba.org.
The only project that didn't have that was http://www.manythings.org/anki. I have corrected that today, by inserting one extra field on each line to include attribution. This does add a bit to the file sizes, but shouldn't really bother people too much.
You can see a quick screenshot, so you don't need to download a file.
I have agreed that all the sentences I post on Tatoeba now "belong" to Tatoeba. That said, whatever Tatoeba wants, needs to do with them, I'll have no objection. I truly believe that it's every single user's feeling. :)
Tatoeba is attributing our CC-BY content by our username, and then doing the standard thing that people do with CC-BY work: reusing it with attribution and releasing the whole thing as CC-BY. Using a CC-BY license is a way of giving forwards permission to anyone who wants to reuse your work.
If you really want to be within the letter and arguably the spirit of CC-BY, you should be attributing individual contributors. This is what people are generally supposed to be doing when they reuse content from Wikipedia as well.
It says: "for any text to which you hold the copyright, by submitting it, you agree to license it under the Creative Commons Attribution License 2.0 (fr)."
The whole section about intellectual property describes that.
But the two relevant paragraphs are:
"L’infrastructure technique de Tatoeba utilise par défaut, pour la contribution de phrases textuelles, la licence Creative Commons Attribution 2.0 France (CC-BY 2.0 FR)."
= This is saying that we use CC BY 2.0 FR as the default license.
"Lors de la contribution, sur notre Site Internet, d’une phrase dont vous êtes propriétaire, en votre qualité d’auteur·e, vous attribuez une licence à cette phrase."
= This is saying that when you contribute a sentence that is your own sentence, you are applying a license to this sentence.
If you combine these two paragraphs, the idea is that when someone submits a sentence to Tatoeba, by default, they license it under CC BY 2.0 FR.
Yes, providing attribution is the polite thing to do.
From the English CC-BY 2.0 legal code, https://creativecommons.org/lic...2.0/legalcode. This is just an example; one needs to read the relevant license of what is to be added to Tatoeba if one wants to be sure.
I am quoting or referencing the parts that might be problematic for Tatoeba, or that are otherwise good to know. Not a lawyer, not legal advice, and so on. I am not suggesting any particular way of going forward, here, just trying to figure out what the license exactly says. A native speaker or someone with background in law should go through the text, too.
From part 1, definitions:
"Collective Work" means a work, such as a periodical issue, anthology or encyclopedia, in which the Work in its entirety in unmodified form, along with a number of other contributions, constituting separate and independent works in themselves, are assembled into a collective whole. A work that constitutes a Collective Work will not be considered a Derivative Work (as defined below) for the purposes of this License.
"Derivative Work" means a work based upon the Work or upon the Work and other pre-existing works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which the Work may be recast, transformed, or adapted, except that a work that constitutes a Collective Work will not be considered a Derivative Work for the purpose of this License. For the avoidance of doubt, where the Work is a musical composition or sound recording, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered a Derivative Work for the purpose of this License.
Basically, this means that the sentence itself could be a part of collection, but any translations are derivative works, and any edits also create a derivative work. Any larger collection that includes both original CC-BY sentences and their modifications is a derivative work (I think).
This part is 3.b, i.e. rights given to for example Tatoeba project:
"to create and reproduce Derivative Works"
This is 3.d.
"to distribute copies or phonorecords of, display publicly, perform publicly, and perform publicly by means of a digital audio transmission Derivative Works. "
Recall that translations are derivative works. I think this implies that every translation of an outside sentences with the CC-BY license should attribute the original source, provide copyright notice (if any), and link to the relevant license. This is currently unfeasible on Tatoeba, since many user interfaces for translating do not suggest the original sentence is under specific attribution requirements.
This part is from under restrictions, 4.a.:
"...You may not offer or impose any terms on the Work that alter or restrict the terms of this License or the recipients' exercise of the rights granted hereunder. You may not sublicense the Work. You must keep intact all notices that refer to this License and to the disclaimer of warranties...."
I do not know if Tatoeba is sublicensing the sentences or translations thereof.
I do not know whether all the different CC-BY licenses are compatible enough with CC-BY 2.0 French that we do not "alter or restrict the terms of this License or the recipients' exercise of the rights granted hereunder". I suspect it would be better to use the same license as the original.
We also need to add any potential disclaimers of warranties from the source, if any are written there.
Restrictions, 4.b. This is important.
If you distribute, publicly display, [...] the Work or any Derivative Works or Collective Works, You must keep intact all copyright notices for the Work and give the Original Author credit reasonable to the medium or means You are utilizing by conveying the name (or pseudonym if applicable) of the Original Author if supplied; the title of the Work if supplied; to the extent reasonably practicable, the Uniform Resource Identifier, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and in the case of a Derivative Work, a credit identifying the use of the Work in the Derivative Work (e.g., "French translation of the Work by Original Author," or "Screenplay based on original Work by Original Author"). Such credit may be implemented in any reasonable manner; provided, however, that in the case of a Derivative Work or Collective Work, at a minimum such credit will appear where any other comparable authorship credit appears and in a manner at least as prominent as such other comparable authorship credit.
This outlines exactly what information should be included when giving credit to the creator.
First, how to give credit due. It should be "reasonable to the medium or means You are utilizing";
"Such credit may be implemented in any reasonable manner; provided, however, that in the case of a Derivative Work or Collective Work, at a minimum such credit will appear where any other comparable authorship credit appears and in a manner at least as prominent as such other comparable authorship credit.".
Second, the contents of the credit notice. It should include the name of the Original Author if supplied; the title of the Work if supplied; to the extent reasonably practicable, the Uniform Resource Identifier, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and in the case of a Derivative Work, a credit identifying the use of the Work in the Derivative Work (e.g., "French translation of the Work by Original Author.")."
Section 7, termination:
"This License and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this License."
[not needed anymore- removed by CK]
Replying to this thread rather than starting a new one because the topic is related.
The "top five language stats" seem to occupy too much space. The fields for every language seem to be too high, resulting in too much line spacing. The font seems to be too large. This is what it looks like for me: https://imgur.com/a/mIuNdpq.
I wonder if it wouldn't be better to either remove it completely or make it (much) smaller and display more languages instead. The numbers of the top five languages aren't really so interesting and don't show a lot of diversity. Making the languages displayed random would be another option.
[not needed anymore- removed by CK]
Good suggestions worth considering. That would indeed be more interesting. 🙂
I would like it if the Top 5 languages listed the top 5 contributed-to languages of the day, or maybe the week, as opposed an unchanging all-time list.
> The "top five language stats" seem to occupy too much space.
This is part of the transition to the responsive UI. You will see that in general, anything with a list of clickable items will take more space. This was already done for:
- the list of tags
- the links on the sidebar on the profile page
My guess is that the space it takes is not really the problem itself, but rather that the information displayed is too irrelevant for you.
> I wonder if it wouldn't be better to either remove it completely
There was an issue about it that I already wanted to solve several weeks ago:
I aborted my plans because the whole Kabyle discussion was a bit too overwhelming and I had no time to really think carefully about what this block can be replaced with.
I had two ideas in mind:
- Displaying the same stats that are displayed on the homepage for non-authenticated users.
- Displaying the top 5 languages, but limited to the languages that the user has added in their profile.
Removing the whole block was also something I considered, but doing that would remove the possibility to access stats of all languages. The link to https://tatoeba.org/eng/stats/s...es_by_language is only available from the stats block and I wasn't sure where is the best place to put it if this block was to be removed.
I immediately noticed it for the links in the sidebar on my profile page. I was hoping it was only temporary because I use that sidebar a lot and it now requires me to scroll up and down a lot because of the immense spacing. I didn't think that could be intentional. I would probably have kept silent if I hadn't noticed that design choice spreading, in this case to the top five languages.
I think all suggestions made for the latter so far would be improvements to what there is now.
On the other hand, the field where you write comments seems to have shrunk. There I would find something bigger more comfortable. It felt good the way it was before. 🙂
The increase in spacing is intentional for the reason that we will be reusing as much as possible design patterns from Material Design (https://material.io/). In Material Design, there is more emphasis on space because it takes into account mobile experience. If you are browsing from a mobile phone, you need more space between items in order to be able to tap on the desired item.
Having a "compact mode" would definitely be a possibility in the future, but as of now, we are still designing with the default paddings and margins that come with the AngularJS Material framework (https://material.angularjs.org/).
I take note for the comment form. It has indeed shrunk but that one was not intentional.
I think English sentences should also be controlled on Tatoeba by native English speaking corpus maintainers.
Now this task is done about 90% by non native speakers.
I've been seeing Alan doing a lot of the work. More helpers would be very welcome.
• Everyone can use the comment section to suggest corrections.
• Everyone can use the rating system.
• Advanced contributors can put @change tags.
• Corpus maintainers can apply necessary changes for inactive members.
[not needed anymore- removed by CK]
Do you comment, rate or tag the ones that you do not think are okay?
If the wrong ones or the ones supposed to be wrong are tagged, it's not really enough, I think.
I can't still be sure that the rest is fine since not all sentences are checked and fixed.
There are some languages in which each sentence written these days are checked.
But it doesn't work in every language.
(E.g. all sentences in Danish written by non native speakers are tagged. Well, those sentences are much rarely than those in English.)
Of course I saw them.
Perhaps, this tag can be for others disturbing, ones can think those sentences are correct and the other ones aren't.
(I know what that tag is, I've asked you about two, three years ago.)
Thank you, Pfirsichbaeumchen. I just want to add the following:
(1) If you think there's an error in a sentence, but you're not a native speaker of the language it's written in, you can always add the "@check" or "@needs native check" tag as an alternative to "@change".
(2) Whether or not you leave a tag, you should leave a comment. These are more likely to be seen than tags.
Yes, he does, I can see it as well (and sometimes Objectivesee does it too).
And also Patgfisher and cueayotl did it.
(And what I can see yet, even the tagged sentences of mine wouldn't be checked.)
I think as in some other languages, it would be nice if the English sentences written by non native speakers were tagged with OK (or with change) in order that other users also know whether the sentence is O.K.
In many cases, English sentences of non natives are ignored, even if they are wrong.
For statistics, there currently are (rounded to one significant digit):
* 50 000 orphan English sentences: https://tatoeba.org/deu/Activit..._sentences/eng
* 80 English sentences with the @change tag: https://tatoeba.org/deu/Tags/sh...th_tag/561/eng
* 6 English sentences with the @check tag: https://tatoeba.org/deu/tags/sh...th_tag/841/eng
* 50 English sentences with the @needs native check tag: https://tatoeba.org/deu/Tags/sh...h_tag/1207/eng
The amount of orphan sentences is huge, but the other task queues are in very good shape, I would say.
Thanks for the analysis, Thanuir. I believe the number of orphan sentences in English will always be high because:
- the backlog is so high (50,000 would require a year's worth of effort at the rate of 136 per day, which would require hours' worth of work)
- the task of dealing with them is laborious and involves so many judgment calls
- the benefits are relatively small
- there's no good way of sorting out the ones that have been looked at from the ones that haven't
There are many orphan sentences that are grammatically correct but that I would not want to adopt, for one reason or another:
- they're a little unnatural or just old-fashioned
- they reflect a sentiment that I don't agree with
Having said that, I do adopt orphan sentences from time to time. I just don't consider adoption of those sentences as a high priority compared to other tasks that I could be doing, such as marking or fixing incorrect sentences.
my two cents to add some more info to this analysis: there are 19 orphan sentences contaning audio, which might be useful to English learners because they contain audio (there are users who use this as thrr main element for translating native sentences, afaik): https://tatoeba.org/ita/sentenc...sort=relevance
is anybody who contributes audio recordings using a mobile device for recording (tablet or phone)?
If yes, could you please give me some links to your sentences and if possible add a little note whether the internal microphone of the device was used or an external accessory, like ANALOG for the headphone jack or DIGITAL with lightning or even USB...
And a little note about the location of recording would be helpful, too. Like indoors, outdoors, living room, toilet etc. - just to get an idea how much of the room acoustics got captured by the device.
Thanks in advance!
P.S.: Of course all other contributions regarding desktop/laptop setups, recording equipment and recording/editing software are welcome, too.
I would also be interested to know if someone is recording from a mobile phone. Last I checked, the software suggested in the wiki worked only on computers. If you are using a phone, I would appreciate a quick guide on how to do it, in addition to what mramosch already asked. Thanks!
I would also be interested to know :)
We could then update our wiki page to include instructions on how to record from phone or tablet.
Sadly we do not provide support for recording from mobile phones. But I agree that it would be a great addition.
I don't think that anyone should add thousand sentences one after another with the same word(s), in the same theme. That makes Tatoeba fucking boring.
Now, I can hardly find a sentence that would be worth translating.
A bit more creativity please or else I'll fall asleep reading the tenth sentence.
Szerintem az a járható út, amibe már belekezdtél, hogy időnként a magyar mondatokat fordítod, ezzel egy picit diverzebb is lesz a korpusz. Kicsit talán több időt igényel, de csak megéri.
Sokszor látom, hogy ritkább szavakkal nem is az anyanyelviek írtak mondatokat.
Igen, csak ez l'art pour l'art.
I believe that by chosing the translate mode you get mostly filtered sentences from English native speakers or you can translate directly from contributors who you already know and whose sentences you like.
I just try to ignore certain kind of sentences, especially about religion.
I usually translate the last 2-3-4 pages of a language and I find about two sentences of them interesting to translate. I translate from everybody, I dont disuinguish contributors; everyone can add interesting, rare etc. sentences I can learn of.
I don't have any problem with sentences about religion, they can be interesting. Religion is a controversial and deep theme.
* Search for long sentences (not translated to any language or to Hungarian in particular).
* Search for short sentences (as above).
* Search for random sentences.
* Translate from different languages by using the previous methods.
* Search for an interesting tag and translate everything there.
You can also play around with the search settings - audio only, native only, everything including orphan sentences included, etc.
The sentences on Tatoeba are ageless, so there is little reason to only check the recent ones.
Also, in general, let people add the sentences they add, unless they are obviously harmful. Tatoeba is a volunteer project, so the right way to fix things is to add, translate and organize the content you find interesting, while letting others do as they wish.
Ich verwende meistens so eine Suche: https://tatoeba.org/deu/sentenc...=&sort=random. Vielleicht findest Du das interessanter.
Wenn Du Kreativität willst, könntest Du Dich an diese Sätze wagen: https://tatoeba.org/deu/sentenc...=&sort=random.
Speichere die Links einfach in Deinem Profil ab.
* Tatoeba Top 30 Languages Interactive Graphs *
Tatoeba Top 30 Languages Interactive Graphs have been updated:
Is it likely that if I add English vocabulary items to the list any native English speaker will write sentences to them?
I'm asking because I'm planing on adding some as I'm reading a book and I'm collecting unknown words from it. But I don't want to bother with that if there's no avail.
Also I saw a few words in Cyrillic writing among the English words.
Some of my English words have been added, some have not. I do not know whether this has been due to them being vocabulary words or by blind luck. Many English contributions use quite basic vocabulary, but more varied sentences do exist in the corpus.
I usually add unknown words (when reading something on a computer) and then once a while go through my vocabulary and remove the words I understand, leaving the ones I do not, and checking any sentences with those in case they help understanding.
There are quite a lot of English vocabulary items, so in any case it will take quite some time before any particular one is addressed with multiple sentences, unless one gets lucky.
Thanks for replying! :)
I guess I'll add a few more then and see what will happen.
Mark sentences as "not directly linked".
Reason / Why does it help?
Short: Working together on finding directly linkable sentences gets enabled.
If someone tries to find indirectly linked sentences, that could be directly linked, he/she can use this search:
He/She reads through all 65 pages of sentences and figured, all indirectly linked sentences are not directly linkable.
Now a second user wants to do the same: Trying to directly link indirectly linked sentences. This second user will also have to read through all 65 pages of sentences to figure, there is nothing to do.
Bug report for linking sentences:
(1) The following link shows Japanese sentences with indirect German links: https://tatoeba.org/eng/sentenc...io=&sort=words
(2) Link an indirectly linked sentence.
(3) <sentence gets linked>
Expected: <Except of the fact, that the sentence is now linked nothing should've changed.>
Actual: < All other languages for that sentence also get shown >
Search is finding sentences that do not contain the specified word. This was reported 24 days ago (see link 1 below), and @gillux made a change that he thought might have fixed the problem, but apparently it didn't. For instance, a general search for "Tom" in English (see link 2 below) brings up these sentences:
These sentences are owned by a variety of people, and the logs don't reveal anything that suggests that they ever contained the word "Tom". Nor do all the sentences contain a tag, or audio. However, they are all short. If I set the sort order to random or to longest sentences first, I don't see any false hits.
Could it be that the search engine has seen so many sentences with "Tom" that it now hallucinates them even in sentences that don't contain the word?
I reported this problem on GitHub as well:
Link 1: https://tatoeba.org/eng/wall/sh...#message_32265
Link 2: https://tatoeba.org/eng/sentenc...io=&sort=words
I had some similar issues within a day.
Yeah, seems like the same problem resurfaced.
"Where is the butter?"
I just searched again for "darn" as reported by brauchinet in the previous thread:
There are 12 results and 2 are incorrect:
#1202147 - I'm about to die.
#690834 - There are no comments for now.
Do you remember if those were the incorrect sentences when you also checked for this search?
I'm sure that these were not the incorrect results I found the first time.
I noticed that when I use words from wrong search results I get wrong results again.
For example, deniko's "what the fuck" -> "Where is the butter?"
Many errors and even "Get the fuck out!"
I don’t know if this is of any help:
The “darn” search finds 12 results. I did a search using the downloadable data.
The ones not found by the search engine (that is: wrongly displayed) are:
#1359478 Well I'll be darned!
To everyone, please report here every strange search results that you find. Posting the URL of the search is enough.
I really have no idea what is causing this right now. We need to find clues on how to reproduce it systematically.
The search below produces a wrong result, since the searched term is nonexistent in the result.
Trang, I think that your complete reindexing ( https://github.com/Tatoeba/tato...ment-524564520 ) solved the problem. When I search for such words as "Tom", "darn", "histrionic", "away", or "butter", I only see good results now. Thank you!
Just a warning: this may solve the issue only for the short term. We have not identified the cause yet.
At least we know that the issue does not happen in the indexation of the main indexes. But something might be going wrong in the delta indexes, or during the merge of the delta indexes into the main indexes...
We will have to see if it happens again.
Sorry to say it has happened again:
Most of (all of?) the wrong results are recently added sentences.
Search: The audience gasped
A driver was blocking the intersection.
This city suffers from gridlock.
and so on.
Thanks for reporting.
I updated the GitHub issue:
Still haven't identified the cause...
One more observation:
take a wrong search result (for example: totally lost):
#8130253 Tom read a book with his son.
go back 5 English sentences (5 times "previous" with language=eng)
and you get the correct result:
#8130246 Tom was totally lost.
I tried quite a few, it always worked.
(Well, it only works with the most recent sentences - it doesn't with older ones such as:
Kiev is the capital of Ukraine.)
Just curious: How did you figure that out?
Starting from one incorrect result sentence (eg. Tom was totally lost) I got a chain of sentences:
#8130253 Tom read a book with his son
#8130259 He would always break his promise
#8130357 Tom was confused by what had happened
#8130381 Tom thought the attraction was mutual
Obviously the sentence numbers are increasing, but, to my disappointment, not by equal steps. This brought me to the idea that maybe only English sentences count.
That instance follows the "brauchinet rule": pressing the "previous" button five times with the language set to "eng" gets you back to a sentence that contains the search word (in this case, #8130378 ).
I see. Thanks.
This search seems to have been solved by itself. I do not see "Tom reviewed his notes" in the results...
I added translations for that sentence after reporting the issue on the Wall. When I rechecked the search link after several minutes, the irrelevant sentence was gone. Could they be connected? Maybe translating a misindexed sentence triggered something.
The below search finds no result
But there is actually a sentence containing these words:
I suspect one cause:
At the exact time 2019-08-25 17:03 eleven sentences have been simultaneously added:
Perhaps the system cannot correctly assign numbers to the sentences if too many of them are added at the same time.
This because "?" is treated as a one-letter-wildcard.
like in: https://tatoeba.org/eng/sentenc...rom=por&to=und
Exactly. On this wiki page:
you will find the following:
"Leave punctuation out of your search string. Most punctuation will be ignored, but a final exclamation mark (!) or question mark (?) will actually interfere with the search."