الملف الشخصي
الجُمل
المفردات
Reviews
القوائم
المفضلة
التعليقات
التعليقات على جمل JeanM
رسائل الحائط
السجلات
تسجيل صوتي
المدوّنات
ترجِم جمل JeanM
Ah I see, then it was probably one of their sentences which I saw. Is the idea that the corpus maintainer for a given language sets the guidelines for that language?
The only negative side I can see is that, when speaking, I wouldn't normally translate given names depending on the language I'm speaking. So, for instance, I wouldn't refer to you as Richard in English. And to use your example, the one German Maria I know still goes by "Maria" in English-speaking countries.
The positive sides are that it's a good way of introducing variety in the sentences, and of teaching names in other languages. I can also see names being translated in the context of example sentences given in language textbooks, and I have also read some translations of novels which also translated names.
Are there any guidelines regarding the translation of given names? I found some sentences where e.g. Luke in English was translated to Luc in French. It seemed a bit odd to me.
Having worked for a few companies (large and small) that do work in machine learning, I can say that the situation for them is often a little different. I am not a lawyer myself, but the following is based on my first-hand experience.
If a researcher/engineer at one of those companies wants to do some work which uses a dataset, the company's legal team must review the dataset's license. Legal teams love public domain data, because it makes their job easy. Even very liberal licenses that impose requirements (such as the Attribution part of CC BY) or which are written in foreign languages with foreign legal systems in mind (such as CC BY 2.0 FR) make the lawyers' work harder, and will take longer to get approval or may just be denied. Especially if the researcher wants to use a bunch of different datasets, each with its own license.
Now for major languages like French or Japanese, I as a researcher can make a business case for getting the lawyers to spend the extra time reviewing 10 licenses in 10 different jurisdictions: the company has lots of customers speaking those languages, and so better support for them will generate extra revenue. If we're talking about a minority language though, which might make a few people happy and help preserve a culture but is unlikely to make the company any money whatsoever, things are different. Chances are that, unless the dataset is in the public domain or under a license that has already been tested in the local legal system, getting approval will be much harder.
In an ideal world companies would be ok with going through a few extra hoops to support a minority language, but in reality only few places will be willing to do that. This is why I feel wide CC0 support is key – not because I want to make corporations' lives easier per se, but for the sake of communities which speak endangered languages and want to ensure their work is as widely accessible as possible.
Having said that, regardless of whether something like this gets implemented in the future or not, I'd like to say a big thank you to all the Tatoeba admins, community, and developers for the great work they've put into this project. It's pretty amazing what you've all achieved.
> Or you want a license that basically says "My derivative work will be automatically re-licensed to the most permissive license possible if the original work is re-licensed to a more permissive license". I don't think such a license has been created.
Yeah that's pretty much it. I think that can be achieved by simply stating that you release your contribution (and only your contribution – not the derivative work as a whole) under CC0, but I may be wrong as I am not a lawyer.
> But I guess you could publish somewhere that you wish to have all your contributions released under CC0 when possible
I already have a sentence on my profile to that effect, yes. In fact I've noticed a few other users with similar statements.
> Hopefully we will implement the possibility to switch the license of translations in a not too far future, and hopefully you'll still be around by then.
Nice! Until then, I'll make sure to look both ways whenever I cross the street.
Oh I'm not suggesting the translation as a whole be released under CC0 regardless of the underlying sentence's license. I don't think that would stand, as you say. I'm suggesting something that is subtly different.
Example scenario: You write a sentence, and I translate it.
What I'm proposing is to make a distinction between:
(1) your underlying sentence, which is your own (copyright-protected) expressive creation;
(2) and my translation, which is a combination of my own expressive creation and yours.
I am further suggesting that you allow users to license their own expressive creations under CC0, if they so desire.
This might seem silly because how can one possibly take my translation and "separate" my expressive creation from yours? Obviously users of the translation will still need to abide by the license imposed by you. However, I can think of at least two scenarios where allowing a distinction between the two separate expressive creations would be useful:
(1) Imagine you license an original sentence under CC BY, I translate it, but I don't actually care about being credited myself for the translation. I should then be able to state that I do not wish to impose any further restrictions on the translation, other than the ones which already exist on the underlying work. This would mean that users of the translation would only have to abide by the CC BY license of the underlying work. Compare this to the current situation, where translators are essentially forced to apply extra restrictions to the translations they contribute (in the form of an extra CC BY license), on top of the conditions that already exist on the underlying sentence.
So basically, the current sitation is:
TRANG's sentence released under CC BY + JeanM's contribution released under CC-BY = JeanM's translation of TRANG's sentence, released under TRANG and JeanM's CC BY licenses *simultaneously*.
And what I think would be quite neat is to make this possible:
TRANG's sentence released under CC BY + JeanM's contribution released under CC0 = JeanM's translation of TRANG's sentence, released under TRANG's CC BY license *only*.
(2) Imagine you contribute loads of original sentences under CC BY, and I translate all of them. At a future point in time, you decide to relicense the original sentences under CC0. I would actually have been fine with releasing the translations under CC0 – but because of the interface, I actually had to apply an *additional* CC BY license to the translations. Then I get run over by a bus / I disappear from the face of the Internet, and so the translations are stuck with the CC BY license. Had I been allowed to state "my own expressive creations are released under CC0" then, once you relicensed your sentences under CC0, the translations would also have been automatically relicensed under the more permissive CC0.
Apologies for the wordiness, and I hope I managed to explain myself more clearly.
My own personal motivation for releasing as much as possible under CC0 is that I am a speaker of an endangered language, and I want to impose as few burdens as possible on potential users of the data I create. I am basically desperate for companies/researchers to use data in my language, and I know that many companies will prefer data the under CC0. (Incidentally, that's also the license required for text in Mozilla Common Voice)
Je crois qu'il s'agit simplement d'une limitation de l'interface de Tatoeba. Si c'est une traduction d'une phrase sous CC0, d'après ce que j'ai compris, légalement vous pouvez toujours la diffuser sous la licence CC0.
--
I think it's just a limitation of Tatoeba's interface. If it's a translation of a sentence under CC0, legally you can always release it under CC0, as far as I understand.
https://en.wiki.tatoeba.org/art...contributions#
"While it should logically be possible to use CC0 for the translations or audio of a CC0 sentence, we have not yet implemented this possibility in Tatoeba. We will consider it once we have a larger number of CC0 sentences."
I'd be curious to hear what people think about the following feature suggestion: allow users to release *any* personal contribution under CC0, even if it's e.g. a translation of a sentence that's under CC-BY 2.0 FR.
The original license would still apply, since a translation is a derivative work – and a warning should probably be shown. However, should the source sentence ever be released under CC0, then I believe this would mean that the translation could also be automatically switched to the less restrictive CC0 (which I favour for my own contributions).
(Although of course I am not a lawyer and I could be completely wrong about all of this.)
Ça marche, merci !
J'ai essayé de rejoindre l'équipe pour l'anglais, mais je n'arrive pas à sélectionner "English" .
I have seen the "do not insert annotations into sentences" policy here: https://en.wiki.tatoeba.org/art...into-sentences
While I am aware of that, I wonder if adding annotations *on top* of sentences (i.e. separately) would be a partial solution here. Below every sentence there could be an extra field, perhaps only displayed to advanced contributors, that allows marking proper nouns, and maybe even other things such as dates (in the simplest form, picture something like the "highlighter" feature of PDF annotation software). This would not have any of the drawbacks listed on the page linked above, as the annotation would be a completely optional separate field that's hidden by default.
The advantages would be that downstream users of the data (e.g. Memrise-style study deck apps, or translation software) could then attempt to replace proper nouns to add some variety to the data. This is obviously not as trivial as I make it sound, as one would have to contend with phenomena such as inflection, but it's certainly a starting point – and inflection could be dealt with downstream, or partially handled by more sophisticated annotation schemata (which could be used to mark gender, declension, etc.).
As a researcher, I have found this project's data to be really useful for machine learning. Is there a list, somewhere on the website, of research papers that have used Tatoeba data? I think it would be a fantastic way of showcasing the project's impact.
A good starting point: https://scholar.google.co.uk/sc...t=1,5&as_vis=1