menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search

Wall (6,616 threads)

Tips

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

Latest messages feedback

CK

yesterday

subdirectory_arrow_right

lbdx

yesterday

feedback

ecorralest101

2 days ago

subdirectory_arrow_right

CK

3 days ago

feedback

CK

3 days ago

subdirectory_arrow_right

sharptoothed

7 days ago

subdirectory_arrow_right

Cabo

7 days ago

subdirectory_arrow_right

fjay69

7 days ago

feedback

sharptoothed

7 days ago

subdirectory_arrow_right

DJ_Saidez

7 days ago

Cangarejo Cangarejo May 28, 2022, edited May 28, 2022 May 28, 2022 at 9:41:47 PM UTC, edited May 28, 2022 at 10:39:59 PM UTC link Permalink

Hi. I have a very long list of sentence IDs. Is it possible to automate the process of creating a list? Is 140,000 sentences too much for a list?

https://ufile.io/sgmx84oa

{{vm.hiddenReplies[38720] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba May 29, 2022 May 29, 2022 at 10:32:33 AM UTC link Permalink

It's possible to automate it, but it would require you to write some custom scripts.

Whether 140,000 sentences is too much depends on what you intend to do with the list. There are bigger lists on Tatoeba already, so storing them is not a problem.

gillux gillux May 29, 2022 May 29, 2022 at 1:11:10 PM UTC link Permalink

May I ask how did you compile the list of sentence IDs? What’s your purpose?
Just being curious about how people use Tatoeba :-)

{{vm.hiddenReplies[38724] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cangarejo Cangarejo May 29, 2022 May 29, 2022 at 2:33:51 PM UTC link Permalink

Tatoeba has a lot of sentences covering some words, but few sentences covering other words. What I did was take a dictionary and pick ten random sentences for each word in the dictionary. This was an experiment to create a more balanced subcorpus of English. I'm not sure how useful this list is.

Some languages are more well behaved than others, in the sense that they use fewer words not in dictionary form. English is pretty well behaved. Portuguese, French, and German not as much. Chinese languages are probably the most well behaved.

{{vm.hiddenReplies[38725] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cangarejo Cangarejo May 30, 2022, edited May 31, 2022 May 30, 2022 at 7:59:15 PM UTC, edited May 31, 2022 at 8:22:42 AM UTC link Permalink

This is just an idea to help tackle the problem of bored users translating from English.

It's possible to build on it by, for example, giving preference to sentences that belong to certain lists or that have audio.

My simple script is posted below.

https://we.tl/t-9mjAchLkoE


Also, the current script to switch a user's sentences to CC0 only switches original sentences. It should be relatively straightforward to go through the entire list of links looking for pairs where one sentence is CC0 and the other belongs to the user. Those sentences can also be switched. A few runs of the script may be necessary in order to switch all sentences, but at least the script doesn't require any fancy graph algorithms.


And also, sentences that don't have any audio could have a button for reading the sentence out loud using the voice synthesizer on the user's device. It should be possible to do that for some languages.

{{vm.hiddenReplies[38731] ? 'expand_more' : 'expand_less'}} hide replies show replies
AlanF_US AlanF_US May 31, 2022 May 31, 2022 at 1:11:24 PM UTC link Permalink

Could you please describe the script in more detail, especially for those people who can't read scripts, or don't want to download a file without knowing more about it?

{{vm.hiddenReplies[38732] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cangarejo Cangarejo May 31, 2022, edited May 31, 2022 May 31, 2022 at 9:03:04 PM UTC, edited May 31, 2022 at 10:55:23 PM UTC link Permalink

The script includes every sentence that's written in English, together with its ID. It also includes a list of dictionary words. Each word begins with an empty list of sentence IDs. The script splits each sentence into words and for each word in a sentence, the sentence's ID is added to that word's list of sentence IDs. After that part of the algorithm is done, each word has become associated with the list of the IDs of all the sentences that contain that word. Then ten random sentences for each word are picked and duplicates are eliminated. There's nothing more to it.

The code can be found at the very end of the file. The rest is just data. My implementation is huge and clunky because JavaScript doesn't really allow local pages to open local files, as a security precaution. JavaScript isn't meant to be used for this kind of task.

By the way, most words on the dictionary have few or no sentences.

lbdx lbdx June 1, 2022 June 1, 2022 at 5:51:16 AM UTC link Permalink

Thank you for sharing this issue with the community. I too would like to easily create and update very long lists on Tatoeba. Ideally, a button placed in the list header would allow us to edit its sequence of sentence IDs directly on the site.

nipbud nipbud June 1, 2022 June 1, 2022 at 3:51:54 PM UTC link Permalink

The API endpoint for it is actually extremely simple. You need to have cookies from the site so you you can either transfer those to a program like curl or whatever or you can just paste the code into the browser console on any page on the site. This seems to work for me:

async function addToList(list, sente){
await fetch(`/is/sentences_lists/add_sentence_to_list/${sente}/${list}`);
}

addToList("170279", "3329778");

hecko hecko June 1, 2022 June 1, 2022 at 8:12:51 PM UTC link Permalink

i have a script to add cc0 sentences to a list, and one thing that hasn't been said before is that it's rather slow
there seems to be a rate limit on the server side where you can't add more than 1 sentence per second
so the adding in your case will take almost 2 days

Cabo Cabo May 27, 2022 May 27, 2022 at 1:42:32 PM UTC link Permalink

Do we have a problem?
When you search for a random sentence, then you find on average that many procent of sentences what many each of the members have. Okay, it's obvious. We already know that.
But!
Some members are just writing English sentences incredibly fast and in the way they make a bunch of similar ones, slowly but steadily they infiltrate the 'random sentences from everyone' section, when they reach one hundred thousand sentences among one million ones the system will show their sentences 10% of the time. (if I don't translate them, and others also not doing the same thing, the percentage will increase in case we are searching for yet not translated ones)
Is it bad?
See for yourself, what I know is: first we will switch to contributors we like, second, the newcomers' number who willing to contribute will decrease seeing lack of diversity, third, the number of untranslated sentences will go high.

Why are you seeing more and more Tom, Ziri, Sami, Layla sentences?
Simple, whether you haven't translated enough to not see them or someone is writing more than you can translate.

{{vm.hiddenReplies[38714] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba May 28, 2022, edited May 28, 2022 May 28, 2022 at 3:04:14 PM UTC, edited May 28, 2022 at 3:47:08 PM UTC link Permalink

For the random sentence on the homepage, we could change it to select a random user first and then select one of their sentences, effectively giving equal weight to everyone, no matter how many sentences they have. But for the random ordering of search results, this would be a bit harder to implement. Maybe we could add an option to the search page to show at most one sentence per user.

Other options to cope you haven't mentioned yet:
- translate from languages other than English
- write English sentences yourself
- use Tatominer
- sort sentences by creation date and translate mostly sentences added before July 2011

ddnktr ddnktr May 28, 2022 May 28, 2022 at 6:50:43 PM UTC link Permalink

I like the suggestion about adding a feature to display at max one (or however many) sentence(s) per user.

We have an option in the advanced search that allows us to search for sentences by a particular user. I think it would also be useful if there was a feature that allowed us to exclude sentences from certain users. If you could enter multiple usernames into a field, separating them with commas like tags, I think that would be really useful and would clear up problems like Cabo's.

I have another example of a case in which this feature would be useful. Sometimes I search for longer sentences to translate for variety, since the majority of sentences on Tatoeba are fairly short. The easiest way to do this, as far as I can tell, is to perform an advanced search with the "sort by" set to "fewest words first" and then to click the "reverse order" button. This shows all the longest sentences first. The downside I encountered is that some contributors (generally speaking, only a handful per language) upload long texts from classical literature or other public domain works. There's nothing wrong with that, of course, but if I want to just translate original (written by user) sentences, it means I have to go through them individually and sort them, which is a little time consuming.

Speaking of original sentences, it would also be cool if we could implement an "original only" check box in the advanced search, in similar manner to the function on the "sentences" page for each user. That way I could search for original sentences by a user that haven't been translated into a given language. Maybe someone has mentioned one or more of these points before, but that's just what I'm thinking. I don't know how easy/difficult it would be to implement these changes.

{{vm.hiddenReplies[38719] ? 'expand_more' : 'expand_less'}} hide replies show replies
Yorwba Yorwba May 29, 2022 May 29, 2022 at 11:00:34 AM UTC link Permalink

> I like the suggestion about adding a feature to display at max one (or however many) sentence(s) per user.

I wrote a new issue on GitHub: https://github.com/Tatoeba/tatoeba2/issues/2943

> I think it would also be useful if there was a feature that allowed us to exclude sentences from certain users.

This kind of feature has been requested before. https://github.com/Tatoeba/tatoeba2/issues/2008

> it would also be cool if we could implement an "original only" check box in the advanced search

This has also been requested before. https://github.com/Tatoeba/tatoeba2/issues/2159

> Maybe someone has mentioned one or more of these points before, but that's just what I'm thinking.

When you tell us what you're thinking and it happens to be something that someone else has talked about before, that's a good thing, because it helps us find out which features people would like to have the most.

{{vm.hiddenReplies[38723] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo June 1, 2022 June 1, 2022 at 2:10:16 PM UTC link Permalink

It seems a good idea, just checking how many times an identical user's sentences appear and limit it.
It would be great seeing a mixture of different contributors' translations and sentences.

lbdx lbdx June 1, 2022 June 1, 2022 at 6:16:55 AM UTC link Permalink

> f I don't translate them, and others also not doing the same thing, the percentage will increase in case we are searching for yet not translated ones

@Cabo, I share your concern. If nothing is done, these "shunned sentences" will indeed be more and more visible on Tatoeba.

To solve this issue, perhaps shunned sentences should be detected and unapproved after a few years without any translation. The shunned sentences could be the sentences added by a contributor over a period of time and significantly less translated than their peers from other contributors.

{{vm.hiddenReplies[38737] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo June 1, 2022, edited June 1, 2022 June 1, 2022 at 2:05:13 PM UTC, edited June 1, 2022 at 2:06:20 PM UTC link Permalink

"To solve this issue, perhaps shunned sentences should be detected and unapproved after a few years without any translation."

Significantly more people are translating from English than fe. from Finnish.
Who or what will be assigned to detect such sentences?
If I don't see those sentences in the random search, because someone with only hundred sentences wrote them and others don't see it, too, then after a few years they are considered shunned?

{{vm.hiddenReplies[38739] ? 'expand_more' : 'expand_less'}} hide replies show replies
lbdx lbdx June 1, 2022, edited June 1, 2022 June 1, 2022 at 4:06:22 PM UTC, edited June 1, 2022 at 5:25:06 PM UTC link Permalink

I'll try to be a little more specific with an example. Let's imagine that member X added several English original sentences in 2019 and that today, the percentage of these sentences that have been translated is 10 times lower than the average among all English contributors of 2019. If this difference in performance is statistically significant, I think it would be wise to consider the untranslated sentences that member X added in 2019 as "shunned" and to reduce their visibility in favor of more promising sentences.

This system would be automated and would allow the Tatoeba Project to take advantage of the "wisdom of the crowd" to self-regulate and stay relevant over time.

Eccles17 Eccles17 May 15, 2022 May 15, 2022 at 2:23:10 AM UTC link Permalink

Are our contributions used to train artificial intelligence?

{{vm.hiddenReplies[38657] ? 'expand_more' : 'expand_less'}} hide replies show replies
DJ_Saidez DJ_Saidez May 15, 2022, edited May 15, 2022 May 15, 2022 at 5:40:28 AM UTC, edited May 15, 2022 at 5:40:33 AM UTC link Permalink

Very potentially, although I wouldn't know of any current projects actively using the data.

Yorwba Yorwba May 15, 2022 May 15, 2022 at 8:50:30 AM UTC link Permalink

Definitely. A search for "tatoeba machine translation" on the arXiv https://search.arxiv.org/?in=&q...%20translation returns 177 papers.

lbdx lbdx June 1, 2022 June 1, 2022 at 6:41:08 AM UTC link Permalink

Tatoeba is not an ideal corpus to train machine translation models because it is rather small and not representative of the sentences encountered in real life. However, it is sometimes used to complement other resources.

Besides, as it covers many language pairs and has quite good translations, Tatoeba is often used to evaluate the performance of these machine translation models. The Tatoeba Translation Challenge and the XTREME benchmark are good illustrations of this:
https://github.com/Helsinki-NLP/Tatoeba-Challenge
https://sites.research.google/xtreme

CK CK May 30, 2022, edited May 30, 2022 May 30, 2022 at 4:47:04 AM UTC, edited May 30, 2022 at 4:54:12 AM UTC link Permalink

More than one audio file per sentence is now possible.

[#280288] Birds of a feather flock together.

Click the audio play button over and over again to hear various voices.
This example sentence has about 30 audio files, as a test.

{{vm.hiddenReplies[38726] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo May 30, 2022, edited May 30, 2022 May 30, 2022 at 8:39:44 AM UTC, edited May 30, 2022 at 9:11:12 AM UTC link Permalink

Sounds great, all of them, but how do I know that the sentence has more than one audio recording?
//
I'm just bird brained... I see all of the audio recordings.

Super update! :)

{{vm.hiddenReplies[38727] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir May 30, 2022 May 30, 2022 at 9:04:07 AM UTC link Permalink

Näen ne kaikki listattuna oikealla puolella lausenäkymässä. (Käytän tietokonetta, en tiedä miltä tämä näyttää esimerkiksi puhelimella.)

{{vm.hiddenReplies[38728] ? 'expand_more' : 'expand_less'}} hide replies show replies
Cabo Cabo May 30, 2022 May 30, 2022 at 9:09:19 AM UTC link Permalink

Oh, yes, I see all of them, I just have to scroll down a bit.

deniko deniko May 30, 2022, edited May 30, 2022 May 30, 2022 at 11:05:09 AM UTC, edited May 30, 2022 at 11:07:55 AM UTC link Permalink

Wow, that's a pretty cool update!

Would it be possible to add some information about "self-identified" accent? At least some high level accent information - (so, for English, it could be American, British, Australian)? And maybe something more detailed ("...born in Newcastle, grew up in Liverpool")

CK CK June 1, 2022 June 1, 2022 at 3:59:08 AM UTC link Permalink

For those interested, you can hear multiple voices from Common Voice on several proverbs.

https://tatoeba.org/en/audio/of/CVAF

83 audio files for the 5 sentences.

sharptoothed sharptoothed May 29, 2022 May 29, 2022 at 7:15:40 AM UTC link Permalink

** Stats & Graphs **

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/

TRANG TRANG May 28, 2022 May 28, 2022 at 4:10:20 PM UTC link Permalink

** Tatoeba Stream #19 **

There will be a live stream tomorrow (Sunday) at 18:30 UTC.

https://youtu.be/3aRkxkbus_A

As usual, there will be a live update of the Tatoeba website. This one will include the long-awaited feature to handle multiple audio per sentence.

See you around!

lbdx lbdx May 28, 2022 May 28, 2022 at 7:24:18 AM UTC link Permalink

** Tatominer **

Thanks to Sarchia, Yorwba, dnnywld, AlanF_US, grechka1, Micsmithel, deniko, Selena777, Amastan, ddnktr, janTuki, gatdet, maaster, carlosalberto, small_snow, cojiluc, Sprakify, nipbud, Auride, marafon, aldar, LeeSooHa, nickyeow, samir_t, sebek5000, vahanm, Snezha_aa, Seael, Ergulis, tormented, danepo, H_Liliom, soweli_Elepanto, Rovo, Cabo, evabarczak, Snezha_a and Balamax for their 310 contributions that helped move the project forward this week.You can find these new sentences at https://tatoeba.org/en/sentence...how/169859/und

Check out top searched words that lack sentences or translations in your language at https://tatominer.netlify.app.

May 25, 2022, edited May 25, 2022 May 25, 2022 at 8:20:28 AM UTC, edited May 25, 2022 at 8:21:28 AM UTC link Permalink
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

TRANG TRANG May 21, 2022 May 21, 2022 at 8:07:59 PM UTC link Permalink

** Tatoeba Stream #18 **

There's a live stream scheduled tomorrow (Sunday) at 18:00 UTC.

https://youtu.be/XKSd_rIlpqc

I'll just be doing routine work on Tatoeba. There's the topic of quality control that I didn't have time to talk about in the last stream (#17) but that's going to be for another time since there's quite a lot of other things I need to catch up on.

See you around!

lbdx lbdx May 21, 2022 May 21, 2022 at 7:57:45 AM UTC link Permalink

** Tatominer **

Thanks to Cangarejo, Yorwba, deniko, cojiluc, ddnktr, Micsmithel, DavidDias, AlanF_US, Luornu, small_snow, carlosalberto, Vincent68, dnnywld, grechka1, aldar, glavsaltulo, Lehon, talou, sebek5000, skanne, Amastan, Sarchia, Snezha_aa, maaster, danepo, Rafik, Russell_Ranae, gatdet, nonong, madjidoumnia, samir_t, marafon, Silja, nipbud and martinod for their 331 contributions that helped move the project forward this week.You can find these new sentences at https://tatoeba.org/en/sentence...how/169859/und

Check out top searched words that lack sentences or translations in your language at https://tatominer.netlify.app.