Здесь вы можете задать вопросы по общим темам, например, о том, как использовать проект Tatoeba, сообщить об ошибках или странном поведении сайта или просто пообщаться с другими представителями сообщества.
Прежде чем задать вопрос, не забудьте прочитать ЧаВо.
Стена (4861 дискуссий)
He's still doing work right now so I'm not gonna disturb. A whole year of duplicates to clean up...
- The Book of the Dead
There was some activity 2 days ago, 3 days ago, etc.
Nothing yesterday, but I don't think we generate so many duplicates in a day, there could be days without duplicates at all.
the highest number of exact duplicates that went through the database have been taken care of, I guess. Now Horus will merge sentences when they'll be changed and become exact duplicates of another sentence after the change, or when the original flag was wrong and then changed to the exact one (if the sentence with the correct flag was also there), so the sentence with the wrong flag will become an exact duplicate
I can finally share some big news with you. Tatoeba will be receiving a $25,000 grant via the Mozilla Open Source Support (MOSS) program. This was a long process, but it's now finally official :)
A little bit of background story.
Back in October last year, folks from Mozilla got in touch with us to explore possible ways of collaboration. They're working on a project called Common Voice and with this project they basically want to collect people's voice. A lot of it.
To achieve this, they need sentences for people to read. Someone told them about Tatoeba... And that's how it started.
But it's not that simple.
One of the requirements of Common Voice is to be able to release their data under CC0 (the Creative Commons version of public domain). Tatoeba's data is CC-BY. Common Voice cannot reuse CC-BY sentences to record audio that they'll publish as CC0. They can only reuse sentences that are in the public domain or CC0.
So there's quite some work to do there, if we want to let Common Voice reuse sentences from Tatoeba. This is what the MOSS grant is for. We cannot change our CC-BY license for the data we've released so far. But we can evolve Tatoeba to handle more licenses than just CC-BY.
I'll be explaining more in details later on what changes we plan to do exactly. But until then, I would really like to have an idea where the Tatoeba community stands on this matter.
Would you consider putting part (or all) of your sentences under CC0? Why, or why not? Let me know via this form: https://goo.gl/forms/Nd6FcAoyd1zkfB4I2
Edit: I've been writing "CC-0" this whole time, but it seems CC0 (without the dash) is the correct acronym.
That's probably obvious from your post, so sorry for a dumb question, but do you mean written sentences or voice recordings?
A move from CC-BY to CC-0 does not change whether users can use a project for commercial purposes (they can in both cases), or whether they can make licensing for their own project more restrictive than Tatoeba's licensing (they can in both cases).
So basically, it all comes down to the difference between "I'm okay with someone taking my sentences and doing anything they want with them, as long as they say that they originally came from Tatoeba" and "I'm okay with someone taking my sentences and doing anything they want with them, period."
Is that correct?
There may definitely be cases where CC0 is not an option, but I wouldn't say that using CC0 within Tatoeba will never be possible. We'll have to see.
I'll quote what is written in the Creative Commons page for CC0:
"while no tool, not even CC0, can guarantee a complete relinquishment of all copyright and database rights in every jurisdiction, we believe it provides the best and most complete alternative for contributing a work to the public domain given the many complex and diverse copyright and database systems around the world."
> taking my sentences and doing anything they want with them, as long as they say
> that they originally came from Tatoeba" and "I'm okay with someone taking my
> sentences and doing anything they want with them, period."
> Is that correct?
Yes, that's pretty much it.
Sure I would. I mean, one of the main points of Tatoeba's existence is to be a kind of "multi-sentences-dictionary" that would be a great resource for people who need its sentences. In other words, I true believe we are here to help other people and we will do so if we allow putting all of our sentences under CC-0. Besides, it's a good way to see them being used somewhere.
For audio contributions, it is a little bit different however, as it means that my voice can be used by anyone anywhere out of context (the two first points already exist but the last point seems to be a not-so-negligeable difference).
Try this search.
It turned out to be not an easy task, because most work have to be done manually. If you have any ideas how to make it easier, please tell.
Read the "About" part at the top of the page.
Some Recently-contributed English Sentences (CEFR-A1-B2 Color-coded)
Are you planning to develop it, using also grammar criteria? Do you attribute only one single level to each word?
May I ask which software do you use to sort out the sentences?
Not likely, since that would require a lot more manual labor than just looking up words.
> Do you attribute only one single level to each word?
The level is the lowest level that word has.
For words, with multiple meanings (and thus possibly more than one level), this type of vocabulary analysis won't always get it right.
> May I ask which software do you use to sort out the sentences?
I used the website listed at the top of the page, and then did some manual editing.
I'm more likely to stick the the multiple Graded Reader Levels, since this divides the vocabulary into smaller steps, so is likely more useful for many students.
Here are the levels, listed with explanations and an easy way to search.
You can also directly see the lists on tatoeba.org using this search.
You will notice there are a number of negative numbers in the "since last week" column due to deletions by the duplicate-merging script that Trang ran last week.