Wall (6,616 threads)
Tips
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
CK
yesterday
lbdx
yesterday
ecorralest101
2 days ago
CK
3 days ago
CK
3 days ago
sharptoothed
7 days ago
Cabo
7 days ago
fjay69
7 days ago
sharptoothed
7 days ago
DJ_Saidez
7 days ago

Hi. I have a very long list of sentence IDs. Is it possible to automate the process of creating a list? Is 140,000 sentences too much for a list?
https://ufile.io/sgmx84oa

It's possible to automate it, but it would require you to write some custom scripts.
Whether 140,000 sentences is too much depends on what you intend to do with the list. There are bigger lists on Tatoeba already, so storing them is not a problem.

May I ask how did you compile the list of sentence IDs? What’s your purpose?
Just being curious about how people use Tatoeba :-)

Tatoeba has a lot of sentences covering some words, but few sentences covering other words. What I did was take a dictionary and pick ten random sentences for each word in the dictionary. This was an experiment to create a more balanced subcorpus of English. I'm not sure how useful this list is.
Some languages are more well behaved than others, in the sense that they use fewer words not in dictionary form. English is pretty well behaved. Portuguese, French, and German not as much. Chinese languages are probably the most well behaved.

This is just an idea to help tackle the problem of bored users translating from English.
It's possible to build on it by, for example, giving preference to sentences that belong to certain lists or that have audio.
My simple script is posted below.
https://we.tl/t-9mjAchLkoE
Also, the current script to switch a user's sentences to CC0 only switches original sentences. It should be relatively straightforward to go through the entire list of links looking for pairs where one sentence is CC0 and the other belongs to the user. Those sentences can also be switched. A few runs of the script may be necessary in order to switch all sentences, but at least the script doesn't require any fancy graph algorithms.
And also, sentences that don't have any audio could have a button for reading the sentence out loud using the voice synthesizer on the user's device. It should be possible to do that for some languages.

Could you please describe the script in more detail, especially for those people who can't read scripts, or don't want to download a file without knowing more about it?

The script includes every sentence that's written in English, together with its ID. It also includes a list of dictionary words. Each word begins with an empty list of sentence IDs. The script splits each sentence into words and for each word in a sentence, the sentence's ID is added to that word's list of sentence IDs. After that part of the algorithm is done, each word has become associated with the list of the IDs of all the sentences that contain that word. Then ten random sentences for each word are picked and duplicates are eliminated. There's nothing more to it.
The code can be found at the very end of the file. The rest is just data. My implementation is huge and clunky because JavaScript doesn't really allow local pages to open local files, as a security precaution. JavaScript isn't meant to be used for this kind of task.
By the way, most words on the dictionary have few or no sentences.

Thank you for sharing this issue with the community. I too would like to easily create and update very long lists on Tatoeba. Ideally, a button placed in the list header would allow us to edit its sequence of sentence IDs directly on the site.

The API endpoint for it is actually extremely simple. You need to have cookies from the site so you you can either transfer those to a program like curl or whatever or you can just paste the code into the browser console on any page on the site. This seems to work for me:
async function addToList(list, sente){
await fetch(`/is/sentences_lists/add_sentence_to_list/${sente}/${list}`);
}
addToList("170279", "3329778");

i have a script to add cc0 sentences to a list, and one thing that hasn't been said before is that it's rather slow
there seems to be a rate limit on the server side where you can't add more than 1 sentence per second
so the adding in your case will take almost 2 days

Do we have a problem?
When you search for a random sentence, then you find on average that many procent of sentences what many each of the members have. Okay, it's obvious. We already know that.
But!
Some members are just writing English sentences incredibly fast and in the way they make a bunch of similar ones, slowly but steadily they infiltrate the 'random sentences from everyone' section, when they reach one hundred thousand sentences among one million ones the system will show their sentences 10% of the time. (if I don't translate them, and others also not doing the same thing, the percentage will increase in case we are searching for yet not translated ones)
Is it bad?
See for yourself, what I know is: first we will switch to contributors we like, second, the newcomers' number who willing to contribute will decrease seeing lack of diversity, third, the number of untranslated sentences will go high.
Why are you seeing more and more Tom, Ziri, Sami, Layla sentences?
Simple, whether you haven't translated enough to not see them or someone is writing more than you can translate.

For the random sentence on the homepage, we could change it to select a random user first and then select one of their sentences, effectively giving equal weight to everyone, no matter how many sentences they have. But for the random ordering of search results, this would be a bit harder to implement. Maybe we could add an option to the search page to show at most one sentence per user.
Other options to cope you haven't mentioned yet:
- translate from languages other than English
- write English sentences yourself
- use Tatominer
- sort sentences by creation date and translate mostly sentences added before July 2011

I like the suggestion about adding a feature to display at max one (or however many) sentence(s) per user.
We have an option in the advanced search that allows us to search for sentences by a particular user. I think it would also be useful if there was a feature that allowed us to exclude sentences from certain users. If you could enter multiple usernames into a field, separating them with commas like tags, I think that would be really useful and would clear up problems like Cabo's.
I have another example of a case in which this feature would be useful. Sometimes I search for longer sentences to translate for variety, since the majority of sentences on Tatoeba are fairly short. The easiest way to do this, as far as I can tell, is to perform an advanced search with the "sort by" set to "fewest words first" and then to click the "reverse order" button. This shows all the longest sentences first. The downside I encountered is that some contributors (generally speaking, only a handful per language) upload long texts from classical literature or other public domain works. There's nothing wrong with that, of course, but if I want to just translate original (written by user) sentences, it means I have to go through them individually and sort them, which is a little time consuming.
Speaking of original sentences, it would also be cool if we could implement an "original only" check box in the advanced search, in similar manner to the function on the "sentences" page for each user. That way I could search for original sentences by a user that haven't been translated into a given language. Maybe someone has mentioned one or more of these points before, but that's just what I'm thinking. I don't know how easy/difficult it would be to implement these changes.

> I like the suggestion about adding a feature to display at max one (or however many) sentence(s) per user.
I wrote a new issue on GitHub: https://github.com/Tatoeba/tatoeba2/issues/2943
> I think it would also be useful if there was a feature that allowed us to exclude sentences from certain users.
This kind of feature has been requested before. https://github.com/Tatoeba/tatoeba2/issues/2008
> it would also be cool if we could implement an "original only" check box in the advanced search
This has also been requested before. https://github.com/Tatoeba/tatoeba2/issues/2159
> Maybe someone has mentioned one or more of these points before, but that's just what I'm thinking.
When you tell us what you're thinking and it happens to be something that someone else has talked about before, that's a good thing, because it helps us find out which features people would like to have the most.

It seems a good idea, just checking how many times an identical user's sentences appear and limit it.
It would be great seeing a mixture of different contributors' translations and sentences.

> f I don't translate them, and others also not doing the same thing, the percentage will increase in case we are searching for yet not translated ones
@Cabo, I share your concern. If nothing is done, these "shunned sentences" will indeed be more and more visible on Tatoeba.
To solve this issue, perhaps shunned sentences should be detected and unapproved after a few years without any translation. The shunned sentences could be the sentences added by a contributor over a period of time and significantly less translated than their peers from other contributors.

"To solve this issue, perhaps shunned sentences should be detected and unapproved after a few years without any translation."
Significantly more people are translating from English than fe. from Finnish.
Who or what will be assigned to detect such sentences?
If I don't see those sentences in the random search, because someone with only hundred sentences wrote them and others don't see it, too, then after a few years they are considered shunned?

I'll try to be a little more specific with an example. Let's imagine that member X added several English original sentences in 2019 and that today, the percentage of these sentences that have been translated is 10 times lower than the average among all English contributors of 2019. If this difference in performance is statistically significant, I think it would be wise to consider the untranslated sentences that member X added in 2019 as "shunned" and to reduce their visibility in favor of more promising sentences.
This system would be automated and would allow the Tatoeba Project to take advantage of the "wisdom of the crowd" to self-regulate and stay relevant over time.

Are our contributions used to train artificial intelligence?

Very potentially, although I wouldn't know of any current projects actively using the data.

Definitely. A search for "tatoeba machine translation" on the arXiv https://search.arxiv.org/?in=&q...%20translation returns 177 papers.

Tatoeba is not an ideal corpus to train machine translation models because it is rather small and not representative of the sentences encountered in real life. However, it is sometimes used to complement other resources.
Besides, as it covers many language pairs and has quite good translations, Tatoeba is often used to evaluate the performance of these machine translation models. The Tatoeba Translation Challenge and the XTREME benchmark are good illustrations of this:
https://github.com/Helsinki-NLP/Tatoeba-Challenge
https://sites.research.google/xtreme

More than one audio file per sentence is now possible.
[#280288] Birds of a feather flock together.
Click the audio play button over and over again to hear various voices.
This example sentence has about 30 audio files, as a test.

Sounds great, all of them, but how do I know that the sentence has more than one audio recording?
//
I'm just bird brained... I see all of the audio recordings.
Super update! :)

Näen ne kaikki listattuna oikealla puolella lausenäkymässä. (Käytän tietokonetta, en tiedä miltä tämä näyttää esimerkiksi puhelimella.)

Oh, yes, I see all of them, I just have to scroll down a bit.

Wow, that's a pretty cool update!
Would it be possible to add some information about "self-identified" accent? At least some high level accent information - (so, for English, it could be American, British, Australian)? And maybe something more detailed ("...born in Newcastle, grew up in Liverpool")

For those interested, you can hear multiple voices from Common Voice on several proverbs.
https://tatoeba.org/en/audio/of/CVAF
83 audio files for the 5 sentences.

** Stats & Graphs **
Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/

** Tatoeba Stream #19 **
There will be a live stream tomorrow (Sunday) at 18:30 UTC.
https://youtu.be/3aRkxkbus_A
As usual, there will be a live update of the Tatoeba website. This one will include the long-awaited feature to handle multiple audio per sentence.
See you around!

** Tatominer **
Thanks to Sarchia, Yorwba, dnnywld, AlanF_US, grechka1, Micsmithel, deniko, Selena777, Amastan, ddnktr, janTuki, gatdet, maaster, carlosalberto, small_snow, cojiluc, Sprakify, nipbud, Auride, marafon, aldar, LeeSooHa, nickyeow, samir_t, sebek5000, vahanm, Snezha_aa, Seael, Ergulis, tormented, danepo, H_Liliom, soweli_Elepanto, Rovo, Cabo, evabarczak, Snezha_a and Balamax for their 310 contributions that helped move the project forward this week.You can find these new sentences at https://tatoeba.org/en/sentence...how/169859/und
Check out top searched words that lack sentences or translations in your language at https://tatominer.netlify.app.
The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

** Tatoeba Stream #18 **
There's a live stream scheduled tomorrow (Sunday) at 18:00 UTC.
https://youtu.be/XKSd_rIlpqc
I'll just be doing routine work on Tatoeba. There's the topic of quality control that I didn't have time to talk about in the last stream (#17) but that's going to be for another time since there's quite a lot of other things I need to catch up on.
See you around!

** Tatominer **
Thanks to Cangarejo, Yorwba, deniko, cojiluc, ddnktr, Micsmithel, DavidDias, AlanF_US, Luornu, small_snow, carlosalberto, Vincent68, dnnywld, grechka1, aldar, glavsaltulo, Lehon, talou, sebek5000, skanne, Amastan, Sarchia, Snezha_aa, maaster, danepo, Rafik, Russell_Ranae, gatdet, nonong, madjidoumnia, samir_t, marafon, Silja, nipbud and martinod for their 331 contributions that helped move the project forward this week.You can find these new sentences at https://tatoeba.org/en/sentence...how/169859/und
Check out top searched words that lack sentences or translations in your language at https://tatominer.netlify.app.