Wall (6,616 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
2 days ago
3 days ago
3 days ago
7 days ago
7 days ago
7 days ago
7 days ago
7 days ago
Horus shouldn't have to link to Hungarian sentences.
That #708238 wasn't the first case with wrong link by Horus.
Ymmärtääkseni Horus ei tee uusia linkkejä, mutta siirtää kaikki yhdistettävien lauseiden linkit. Jos sen takia syntyy vääriä linkkejä, sen pitäisi johtua siitä, että joku muu oli linkittänyt yhdistettäviä lauseita väärin, ei niinkään Horuksen toiminnasta.
¿Cómo se dice "anti manchas" en latín?
¿Cómo se dice "correpasillos" en latín?
I don't understand how "No one is too old to learn" and "It is never too late to learn" are considered synonymous.
English —> Romanian
Languages in my Profile: English, Turkish, Romanian, Malay
Nu e niciodată prea târziu pentru a învăța.
It is never too late to learn.
It's never too late to learn.
You're never too old to learn.
One is never too old to learn.
No one is too old to learn.
No man is so old he cannot learn.
No one is so old but he can learn.
Nobody is too old to learn.
You will never be too old to learn.
You'll never be too old to learn.
Öğrenmek için asla çok geç değildir.
Öğrenmek için asla geç değildir.
Since this is specifically about a given sentence, I think you should leave the comment on the sentence itself, instead of on the Wall.
[#2794282] Nu e niciodată prea târziu pentru a învăța. (carlosalberto)
I wrote it on the wall because I had complained about a similar issue before for a particular sentence, but was told that if English sentence X is translated into Turkish sentence Y with some ambiguity, then if sentence Y is translated into Romanian sentence Z, even though Z is not a correct translation of X due to the ambiguity introduced by Y, Z doesn't need to be fixed per Tatoeba policy.
As long as this Romanian sentence can be correctly translated by all the directly-linked English sentences, then there is no problem.
@CK — Here's another example of sentences that are grouped together as part of the results of a search, but clearly one or more of the translations were incorrect at some point, causing subsequent translations to be "incorrect" like in the game of "Telephone".
• Languages in my profile: English, Turkish, Romanian, Malay
• Results URL: https://tatoeba.org/en/sentences/show/672252
Here are the same results sorted in order of sentence ID to show the chronology among the sentences:
#596092 — He is running.
#672252 — You run.
#672254 — He runs.
#672264 — She runs.
#870252 — Koşuyor.
#1473297 — Sen koşarsın.
#1872079 — She is running.
#7088820 — He's running.
Koşuyor (which means "he/she is running") was based on 596092.
Sen koşarsın (which means "you run") was based on 672252.
But how did the first two English sentences get linked together? How can "He is running" be ever equivalent to "You run"? Because of this initial mistake, subsequent sentences in other languages were incorrect when compared to the initial sentence, but correct when compared to the incorrect sentence earlier on.
When searching for "You run" from English to Romanian, I get:
which is 100% wrong because "Aleargă" refers to the 3rd person singular (he/she runs/is running) and definitely not the 2nd person singular "you run". When I wrote a comment about this error for that Romanian translation, the author of that translation replied that she translated it from the TURKISH sentence "koşuyor" which is translated from "HE is running" which was incorrectly tied to "YOU run" based on what I wrote above.
Hope I explained it clearly. It's convoluted!
While translation errors happen, the effect of "Chinese whispers" is real.
In 2020 we had a "cloud" of 400 thousand sentences, all linked to each other through other languages. Some people tried to break the cloud apart by finding wrong translations, so it's probably smaller now. I need to run my crawler again somehow to take a look of it.
As an example, there was a path (through direct translations) from sentence "Tom is beautiful" to "You know what I'm saying?". Other sentences in that same cloud, each of them being somehow linked through multiple translations to each other:
That's why if you get A as an indirect translation of B, and it's a clearly wrong translation, most probably... it's just ok. Indirect translation are cool and good candidates for being direct translation, but be aware that you can't really rely on them too much if your goal is to find translations of B.
There is no such thing as indirect translation. I already said it once or more.
The problem is, we are telling people that there are indirect translations, no, they are NOT translations, they have a possibility to become one, but they are not directly connected to the sentence what are the sentence's translations. The team again and again uses the term: indirect translations. What people hear from it is only that: "THEY ARE TRANSLATIONS SOMEHOW." and think: "YEEEY. I CAN CONNECT THEM!"
They are translations of translations, that's what they are.
I actually prefer the term "an indirect translation" to "a translation of a translation" as more elegant.
I do acknowledge it might be a tad more confusing to those who don't understand how they work and that you shouldn't assume they could be linked without giving it a good think.
Still, Cabo has a point, and recalling my first days using Tatoeba, I think it's very confusing that *non-translations* ("indirect translations" are indeed NOT actual translations whatsoever!) show up among translations. Of course it didn't help that Clozemaster used "indirect translations" for creating exercises, hence reassuring that (often completely wrong) "indirect translations" are somehow Tatoeba links.
Honestly, I would consider hiding "indirect translations" behind some dropdown or I don't know, maybe give users the choice to show them anyway but not as the default. They are too irrelevant to be presented among the actual translations.
"Honestly, I would consider hiding "indirect translations" behind some dropdown or I don't know, maybe give users the choice to show them anyway but not as the default. They are too irrelevant to be presented among the actual translations."
+1 or +0,5, because it creates big problems to newcomers, not to advanced contributors (probably so), but when an advanced contributor decides to connect any what they can find then always clicking an extra button is problematic.
In my setup (the old design), "indirect translations" or "translations of translations" are represented graphically by a gray-scale version of the same arrow that represents the direct translations. IMO, one step on the way would be to use big question mark instead of an arrow.
In the new design, the same observation holds, although the arrows are replaced by ">" signs. And there is a text prompt: "Show XX more translations", counting both direct and non-direct translations, which should be replaced by something more suitable.
> ... hiding "indirect translations" ... maybe give users the choice to show them anyway but not as the default.
Making the showing of "indirect translations" an opt-in choice, rather than the default, would make things less confusing.
This way, new members and non-members would not be confused by these.
Minimally, perhaps we could make "indirect translations" not show to anyone with less than "advanced contributor" status, since they don't have the option to link these.
[Edited on 2022-06-20 to include the following Github link.]
Make the showing of "indirect translations" an opt-in choice
This is actually how I translated it in the Icelandic UI "Þýðingar af þýðingum". I feel like that's where most people see "indirect translation" and get confused, in the sentence view
+ times [number of languages available]
> But how did the first two English sentences get linked together? How can "He is running" be ever equivalent to "You run"? Because of this initial mistake...
These sentences are linked together via this Xhosa sentence: #1806428
I don't know if there's a mistake. I guess not, even Spanish or Italian can have the same verb form for "you" and "he".
So, "You run" and "He is running" are (assumingly) both valid translations of the same Xhosa sentence. In this case, it's a bit confusing to call one of these English sentences a "translation of a translation" of the other one, though it's not wrong.
If you don't want to see indirectly linked sentences, you could set "Link" to "direct" in the advanced search. It's set to "any" by default.
Tatoeba is designed to provide indirect translations, even if they can't possibly always be exact. If you think it over, it's hard to imagine a way to achieve exactness, given the inherent ambiguity of languages.
In the current design it's clearly impossible to adjust one's translation to both the current sentence and the original sentence (which might be Japanese, that you can't read). And that wouldn't even be the whole solution.
Even if Tatoeba were only a collection of (English) sentences and their direct translations into other languages, you couldn't be sure that the Turkish translation is exactly equivalent to the Romanian translation.
> But how did the first two English sentences get linked together? How can "He is running" be ever equivalent to "You run"?
You will notice that these are not directly linked. It's not only possible that pronouns can be different when translating, since some languages don't use pronouns, but verb forms can easily be different for indirect translations.
Sounds a proverb. Proverbs do not need to be exact translations of each other.
** Blind-linking **
A couple of weeks ago, some members raised awareness on issues around “blind-linking”, that is linking to a language that you don’t know at all. For instance, I don’t know Arabic at all, so if I link an existing English sentence to an existing Arabic sentence, that would be blind-linking.
The topic started in this thread:
My personal feeling was that blind-linking hasn’t really caused huge issues. But the sentiment turned out to be different for other members, and it seems that the issue is bigger than what I had perceived.
To be honest, I do not have data to evaluate how bad the situation is but I can definitely see how letting people link between whatever language they want can lead to a higher risk of mistake. My suggestion was that we could implement in the source code of Tatoeba a rule that restricts people to link only in languages listed in their profile.
Obviously, that doesn’t solve it all because one could always add every language to their profile if they wanted to link between all languages. But at least it gives some sort of implicit guideline as to what is a good practice and what is a bad practice. If I want to link an English sentence to an Arabic sentence, I would have to add Arabic to my profile. But if I have no intention to learn anything about Arabic, then it would be weird to add Arabic to my profile. As a result, I might think twice about linking and decide that it’s for the best if I don’t do it until I actually have an interest in learning Arabic.
In the thread where this topic started, there were a few people in favor of implementing this rule. I also personally think it makes sense to have this rule in place. But I would like to hear more opinions about this. As far as I’ve checked, there’s a non-negligible number of contributors who link from/to languages that are not in their profile. If you are one of them, I’m particularly interested to hear your thoughts on this. Would this make your life harder for no reason? Do you think it is a bad thing to discourage people from linking in languages that they don’t know at all?
Anyway, until we figure out what to do here, I would kindly ask everyone to avoid linking to languages they don’t know at all :)
I would like to add one question: there are languages that are very close to each other, for example, I've never studied Ladino or Asturiano but I can understand >90% of the sentences in these languages because they're very close to Spanish. Would they count as languages I don't know at all or would it be OK to link them? (I generally only link between the sentences indicated in my profile and Spanish but sometimes I'm tempted to link sentences in these two languages with their translations in more studied languages so that they get more visibility).
If you can understand 90%, then you know the language a bit. So it's OK to link :)
I added Old Norse to my profile for this and noted that's why it's there, even though I can't really write it. I feel like that's sane
Jossain vaiheessa katsoin erinäisten kielten lukusanoja ja linkitin yhdestä (tai nollasta) kymmeneen laskevia lauseita toisiinsa. Vaikutti melko turvalliselta asialta tehdä ja olin varovainen esimerkiksi yksi-sanojen sukujen kanssa.
Tämä oli helpompaa, koska pystyin linkittämään kieliä, joita minun profiilissani ei ollut. Toisaalta ei tämä myöskään tuonut merkittävää lisäarvoa korpukseen, epäilen, eikä tämän kieltäminen olisi suuri menetys.
Anyway, until we figure out what to do here, I would kindly ask everyone to avoid linking to languages they don’t know at all
= Mindenesetre, amíg nem találjuk ki, hogy mi legyen itt, addig mindenkit arra kérnék, hogy kerülje a linkelést olyan nyelvekre, amelyeket egyáltalán nem ismer
It may seem too much but I would encourage people to only link the sentences in their native language to the ones in the other languages they know. At least, that's what I try to do myself. It would help us avoid a lot of mistakes.
Lehet, hogy túlzásnak tűnik, de én arra biztatnám az embereket, hogy csak az anyanyelvükön szereplő mondatokat kössék össze a többi nyelvvel, amelyeket ismernek. Legalábbis én magam igyekszem ezt tenni. Ez segítene elkerülni a sok hibát.
Horus also linked sentences incorrectly.
> Horus also linked sentences incorrectly.
I suppose this depends on the original links before Horus merged the exact duplicates
observation: the more translations a sentence has, the more useful each translation is due to indirect links (e.g. an english sentence translated into spanish and french helps spanish people learning french)
problem: boring sentences are easier to translate and thus are more likely to have a lot of translations, so those who translate interesting sentences are making their contributions less useful (in the language pair sense)
proposed solution: make short lists of interesting sentences and get people to focus on those, thus mutually boosting the usefulness of the translations
note: i have *no* idea as to whether or not this is actually a good way of doing things, and especially if it's any better than just using tatominer and looking for translated sentences outside the first page of search results
I don't know... maybe we have that list already. called favorites.
,,actually very good point! thank you!
It's kinda objective what makes an interesting sentence, so making an algorithm for that is hard
Tatominer tries to do something similar, going through words that don't have translations or example sentences
i was thinking of this being a manual thing but now that you bring it up i just *gotta* try automating it
though if i use machine learning for this (and i probably will) i'll have to take care not to include too many sentences with `Tom` in them in the "these are boring" dataset, else it'll latch onto that as the sole indicator of novelty
I mean, how wrong would that be? ;)
Maybe you just exhausted translating that lazy sentences.
You know, the short ones, when you don't even have to sweep your eyes through them, in the first millisecond you know the translation and adding a translation just takes way more time.
If machine learning is a next step move, I have a better idea. Let the machine translate those similar looking, childplay easy, monotonous, time and energy wasting sentences.
a few problems with that unfortunately
for one machines aren't always good at making natural sentences, which is why i try to only use google translate *after* i've already had a stab at doing it myself so that i'm not tempted to accept its version as "good enough"
for two some languages just don't have enough data to be machine-translated, and in fact tatoeba is often used as a data source for them
and for three if we do it on tatoeba and submit them (e.g. for proofreading) we'll have even *more* boring sentences, now in dozens of languages
I agree with Sobz' third point. We should avoid contaminating the corpora of other languages by translating the least interesting sentences.
The best thing to do, in my opinion, is to translate only the sentences that we like the most, so that over time they become a bigger and bigger part of the collection.
We can put those sentences elsewhere. I'm thinking about if we really start translating using machine learning, probably the best we create a separate part for them and not commercialize those sentences as voluntary contributions.
"What is Tatoeba?
Tatoeba is a large database of sentences and translations. Its content is ever-growing and results from the voluntary contributions of thousands of members."
This theme comes up again and again and nothing happens.
The ones who said me they hated those sentences said also what else could they do and translated those further til they quit. Well, I don't really understand people. See e.g. al_ex_an_der's sentences. Those are interesting sentences and one can learn of them. No one translates them. Why do you think did he quit Tatoeba? And when one adds a sentence consisting 1,5 words, contibutors consume that like pigs the swill.
So, I'm totally sceptic.
As I see many of the contributors started to add only their own sentences as a l'art pour l'art. This can be the consequence of the boring sentences.
Tatoeba is a very good idea but it seems to me how the world's going, too, and I can't really see much sense of it.
Maybe we are boring.
My last big try to create useful sentences was this list:
Basic Hungarian - useful for tourists
All of the sentences have English translations, because I tried to find translations to them, and still, many of those English sentences have no other translations.
It's 88 sentences from millions.
> It's 88 sentences from millions.
“My life amounts to no more than one drop in a limitless ocean. Yet what is any ocean, but a multitude of drops?” ― David Mitchell, Cloud Atlas
The third page won't open :(
It open without showing translations:
I get an Internal Error if I want to see it through search.
Showing translations is also very slow.
Yes, I got it opened but only without translations.
> boring sentences are easier to translate
I don't think this is necessarily true.
If you focus on the frequency of individual words, it does seem like it would have to be the case: more common words are likely both boring and easy to translate, while less common words might be interesting, but also harder to translate.
But sentences are not just individual words, but like bags of them, and you can mix and match the contents to balance difficulty and interestingness, e.g. by using mostly common words to keep the sentence easy to translate and then spicing it up with one slightly rarer word to make it more interesting.
My current setup for artisanal mass production of (hopefully) interesting sentences involves a tatominer-style list of not-too-common words which I take inspiration from, plus a frequency-checking script to see whether I accidentally used any even rarer words, and a length limit to prevent me from going full 18th-century novelist with run-on sentences that are so long your eyeballs faint from exhaustion without making it even half the way to the verb at the end of the sentence. I use that to try out a few candidate sentences, and then I add the one I like best.
I'm not sure whether that actually makes me better at writing sentences that are neither too boring nor too hard to translate, because before I adopted this process, my original sentences were mostly only translated by maaster, and after I adopted this process (like, last week, which is why I'm so eager to talk about it) my original sentences are still mostly only getting translated by maaster. Well, at least I myself am not bored.
As you can tell, I was mostly focusing on creating new sentences. But you're right that adding more translations to already translated sentences has additional benefits due to the indirect translations. So trying to make interesting sentences more visible among the sentences we already have is definitely a good idea. I might put my frequency-based bag-of-words model to the task and see whether it comes up with anything... well, interesting.
Suosittelen kääntämään niitä lauseita, joita haluat muidenkin kääntävän, ja vastaavasti kirjoittamaan sellaisia lauseita joita haluat muiden kääntävän. Hankkeen luonteeseen kuuluu, että kaikki saavat kirjoittaa haluamiaan lauseita (tietyissä rajoissa). On hyödyllisempää keskittyä rohkaisemaan hyvän työn tekijöitä tai tarjoamaan heille työkaluja kuin paheksua niitä, joiden valintojen kanssa on eri mieltä.
Erityisesti kannattaa muistaa, että tietokantaa käytetään moniin eri tarkoituksiin. Erilaiset lauseet ovat hyödyllisiä eri tarkoituksissa. Joidenkin kielten kohdalla jokainen, jopa yksinkertaisin, lause on jo suuri saavutus, jos kieli on esimerkiksi erittäin vaarantunut. Myös aloitteleva kielenoppija voi hyötyä yksinkertaisista lauseista. Jos niitä tuntuu olevan liikaa, on tämä suodatuskysymys; tällöin kaivataan työkaluja käyttäjien suodattaa pois tietynlaisia lauseita.
A szöveg (Thanuir) magyarul:
Javaslom, hogy fordítsd le azokat a mondatokat, amelyeket szeretnéd, hogy mások is lefordítsanak, és írd le azokat a mondatokat, amelyeket szeretnéd, hogy mások is lefordítsanak. A projekt jellegéből adódik, hogy mindenki olyan mondatokat írhat, amilyeneket akar (bizonyos határokon belül). Hasznosabb arra összpontosítani, hogy bátorítsuk vagy eszközöket adjunk azoknak, akik jó munkát végeznek, mint hogy helytelenítsük azokat, akiknek a döntéseivel nem értünk egyet.
Érdemes megjegyezni, hogy az adatbázist sokféle célra fogják használni. A különböző kifejezések különböző célokra hasznosak. Egyes nyelvek esetében még a legegyszerűbb mondat is nagy eredmény, ha például a nyelv erősen veszélyeztetett. Még egy kezdő is profitálhat az egyszerű mondatokból. Ha túl soknak tűnik, az szűrési probléma; ebben az esetben olyan eszközökre van szükség, amelyekkel a felhasználók kiszűrhetnek bizonyos típusú mondatokat.
This list contains last week's contributions. Was this list compiled automatically, instead of manually? If so, it should be possible to create a similar list containing all the sentences that need translation this week for Tatominer; and if that list is too short, maybe the number of words could be increased just to create a bigger list. Maybe the sentences in that list would be more interesting; although I guess Tatominer only works for some languages.
If you don't want to rely on Tatominer, you could get a list of the 20,000 most common words in the language you want to translate from (most adults know at least that many words in their native language) and pick words at random to translate sentences on Tatoeba, the advantage being that Tatoeba's search engine knows about declension and conjugation.
> Was this list compiled automatically?
Every Saturday, my script identifies added sentences that match the words of Tatominer to automatically update this list and generate the thank you message on the wall.
> a similar list containing all the sentences that need translation this week
If I understand correctly, instead of clicking on each word on Tatominer, you would prefer to browse all the sentences at once. To do this, an easier way would be to add a general link on each Tatominer page like https://tinyurl.com/2r22bjek . I'll see what I can do.
I'm curious to see what such a list would look like, but I don't know if I'll give it much use, because I'm already used to translating 100 sentences a week, one for each word; but maybe Sobsz or other users would be insterested in using a list like that.
Also, maybe the thank-you message should include a link to last week's contributions.
Also also, I'd rather not open TinyURL links...
But thanks for the hard work!
> add a general link on each Tatominer page like https://tinyurl.com/2r22bjek
I share Cangarejo's aversion to link shorteners, but I followed the link anyway and I think that's a good idea.
I've mostly covered the words Tatominer suggests for cmn-deu, and the remaining ones happen to occur mostly in a few sentences each that are a bit tricky to translate. I'm sure I accidentally skipped a few I could manage to translate, but checking each word individually is a bit demoralizing.
For some reason, it never occurred to me to just use an OR query. Well, now that you suggested it, I can just do it myself. https://tatoeba.org/en/sentence...roved=no&user= 641 results, I'm sure I'll find something to translate in there. (And I understand now why you used a link shortener.)
Maybe these sentences should be in random order. Thank you for your time.
You can get that list in random order using the following advanced search.
I know. I'm just saying that maybe it should be the default on Tatominer. Thanks.
It's not possible to set lists to display sentences in a random order at this time. That's why I suggested using the advanced search showing results in a random order.
The button labeled "All Phrases" on the Tatominer page doesn't display a list, it generates an advanced query.
personally i don't mind it being sorted by shortest first, it means i can go to a page in the middle and it'll have sentences of roughly the length that i like
In that case, the order should be fewest words first, instead of by "relevance".
> Maybe these sentences should be in random order.
Thanks for your comments, but for now, I'm keeping the relevance sort by default to be consistent with the links in the list.
On the topic of sentences with lots of translations, I notice that the most active user on Tatoeba links his sentences to others written in languages that are not listed in his profile. I'm fairly sure he doesn't know all the languages he links to. He does this linking usually after he has added an additional English translation where an existing English translation is already linked to these languages. I suppose the reasoning is this: as the two English sentences have the same meaning, any sentences linked to the original English sentence must also be linked to his new sentence.
But if I'm right, and he can't read those languages, I wonder if it is really acceptable for him to link his sentences to them. The obvious problem with it is that he will reproduce errors in the corpus where existing sentences are linked incorrectly. I would also say that links between languages made by people who don't understand those languages are inherently untrustworthy, and don't serve the purpose of having an accurate corpus.
Does anyone else link their sentences to languages they can't read? Maybe many of you follow the reasoning above, and you do link in such cases. Maybe I'm just being naive again here.
The user in question is an admin, and he's been doing it for years, so you'd be forgiven for thinking that it's acceptable. If the general opinion here is that it's fine, and you all do it, I suppose I might start doing it, too.
I also think it's a problem that a very active contributor with admin status disregards the guidelines of this community. How do you convince other members to follow them then?
Maybe this contributor could explain the reasons for his behavior to us?
Tämä kuulostaa riskialttiilta toiminnalta. Esimerkiksi englannin lyhennykset tai niiden puute, kuten "I am/I'm", kertovat lauseen rekisteristä tai kenties siitä, mitä sanoja painotetaan. Joku toinen kieli saattaa viestiä rekisteriä tai painotusta tavoilla, jotka ovat kauempana toisistaan kuin nämä kaksi, jolloin vain toisen lauseen linkittäminen olisi sopivaa. Tällaisten erojen havaitseminen on vaikeaa, vaikka kieliä osaisi kohtuullisestikin, ja oleellisesti mahdotonta, jos niitä ei osaa lainkaan. Pisteet ja huutomerkin välinen ero vastaavasti.
Toisaalta voi miettiä, että kuinka vakavaa tämä on. Ehkä on arvokkaampaa saada paljon täsmällisesti oikein linkitettyjä lauseita, vaikka mukana olisi muutama hitusen epäilyttävämpi linkki? Harvoin sitä suoran virheen tekee, jos on varovainen, ja sattuu virheitä vaikka osaisikin molempia kieliä.
Kenties olisin tässä taipuvainen luottamaan hyvässä tahdossa tehtyihin linkityksiin ja niiden poistoihin: linkitä yhteen kuuluvat lauseet ja poista virheelliset linkit kun niitä huomaat, ja jos löytyy oikea erimielisyys, aloita siitä keskustelu. Kuitenkin sillä, että ihmiset uskaltavat korjata virheitä ja puutteita kysymättä jatkuvasti lupaa, on myös suuri arvo, ja luultavasti suurempi kuin harvinaista ja hienovaraisten virheiden poistaminen.
Toisaalta toisaalta, jos joku tekee paljon virheitä työtavoillaan, on asiaan syytä puuttua. Mutta ei kai tätä ole havaittu tässä tapauksessa?
> If the general opinion here is that it's fine, and you all do it, I suppose
> I might start doing it, too.
Feel free to do it, or at least experiment with it. That is not something I would do myself, but I would say that it's fine as long as this practice isn't creating incorrect links, or at least not a harmful amount.
As far as I'm aware, we never had a strict rule that contributors should only link sentences in the languages that they know.
If it turns out to be a very problematic behavior, we can always change the source code so that people can only link sentences in languages in their profile. We can also track down which links have been created by a contributor who doesn't know the language they are linking to, and we can remove those links if it's really that bad.
I don't think "blind-linking" should be encouraged because it's much easier to break things than to fix them.
Even if they are rare, bad links are often seen many times before they are removed. In the meantime, every time a learner recognizes a bad link, it makes them question the validity of all the other links.
To build confidence, I think we should leave it to those who know best to validate direct translations.
I fully agree.
For the last ten years, I've seen a lot of bad links created that way. I think it's an existing problem that we shouldn't make worse.
Thank you lbdx, Thanuir, TRANG and marafon for your replies.
I won't be doing any blind-linking. I feel that I have no business linking my sentences to languages I don't understand, regardless.
Let's hope there are enough contributors who have the time and competency in the respective languages to root out the errors introduced, and also the courage to make their objections to these translations known in the comments.
Would you say it's a problem that is bad enough that we should completely prevent people from linking to a language that is not listed in their profile?
Is it a problem that happens with people who freshly became advanced contributors or is it an ongoing issue with all levels of contributors?
I guess it wouldn't solve the problem, given that you can list as many languages as you want. But this solution may reduce the harm and I'd definitely vote for it.
Surprisingly, it mostly concerns the old contributors and even if they have some languages listed in their profile it doesn't mean they master them at the point to do the good linking. The most typical example are the sentences with the formal/informal "you" (vous/tu): some contributors link them quite carelessly.
It may seem too much but I would encourage people to only link the sentences in their native language to the ones in the other languages they know. At least, that's what I try to do myself. It would help us avoid a lot of mistakes.
> Would you say it's a problem that is bad enough that we should completely prevent people from linking to a language that is not listed in their profile?
Yes, I would vote for this.
Other people have made good points about the damage that could be done by blind linking. I want to raise another issue, which is that the benefits that anyone can introduce by blind linking are effectively nonexistent.
If someone with zero knowledge of a language can conclude something about a sentence in that language (namely, that sentence A can be linked to both sentence X and near-duplicate sentence Y), then by definition, that person adds no knowledge, and no value, by inserting the A-Y link. All the knowledge was added by the translator who added the A-X link.
In case anyone wants an example to make this less abstract:
English sentence #3736098 ("I thought perhaps you could help us") was added as an original on January 1, 2015. It was linked by a Macedonian speaker to Macedonian sentence #4168849 ("Мислев дека можеби ќе можеш да ни помогнеш") on May 10, 2015, and then by an Indonesian speaker to Indonesian sentence #5199962 on June 27, 2016. The author of the original sentence then linked it to the effectively duplicate sentence #7173983 ("I thought that perhaps you could help us") on September 18, 2018. Finally, two days ago, he linked both the Macedonian and the Indonesian to the new near-duplicate, despite the fact that neither language is listed in his profile.
It's a mystery to me why he would spend his time with contributions that provide zero value rather than translate from Japanese, a language that he DOES know, into English.
In Russian, the translations should be different:
"I thought perhaps you could help us" - "Я думал, возможно ты сможешь нам помочь"
"I thought that perhaps you could help us" - "Я думал, что возможно ты сможешь нам помочь"
Probably, that also goes for some other languages.
>> It's a mystery to me why he would spend his time with contributions that provide zero value >>
Some people use Tatoeba's sentences for their own projects. I.e. they might need at least an approximate translation to introduce a new English sentence to speakers of another language. Perhaps, the reason is that. For this case, there is a possibility to download the whole Tatoeba database and modify it in any way for someone's personal usage, not affecting Tatoeba itself. I think Tatoeba's owners should provide more information to its users about that.
I wouldn't think it's all so apodictic.
If sentence A has exactly the same meaning as sentence B, one can safely assume that valid translations of A are also valid translations of B.
There is no obvious flaw in this logic. One could even imagine an algorithm that does the linking when someone adds a same-language "translation".
Of course, invalid translations of A are also invalid for B - this would result in a duplication of errors.
The added value may be infinitesimal small, but above zero nonetheless. Imagine sentence A contains a specific term that is used in Northern Germany. Someone adds the same sentence with the Southern equivalent ... and a Turkish sentence now has both as direct translations.
Grammatical modifications (hum, like adding or leaving out "that" in relative clauses) ... well, basically the same argument, but of course it would be enough to just have examples of both ways in different sentences.
Jos kaksi kieltä molemmat sisältävät mahdollisuuden jättää että-sanan pois tietyistä lauseista ilman että merkitys muuttuu selvästi, niin silloin etättömät lauseet ja etälliset lauseet tulisi linkittää toisiinsa, mutta ei ristiin.
I know that he is fishing. - Tiedän että hän on kalastamassa.
I know he is fishing. - Tiedän hänen olevan kalastamassa.
Tällöin molemmille lauseille on tarkempi linkitetty käännös, ja niistä puuttuu vähemmän tarkka käännös, kuin jos kaikki noista olisi linkitetty keskenään.
That's probably the same argument that Selena has brought up for Russian translations.
It would imply a rule that says: Translate as closely as possible and try to mirror the grammar of your source sentence.
German can do a similar thing with relative clauses, and I have the feeling there is a small difference though I can't tell exactly which, so I wouldn't link the two German possibilities.
If a native speaker of English thinks there is no difference between the English alternatives, I would accept crosswise links as equally good.
Yllättyisin jos tässä ei olisi hienovaraisia rekisterieroja.
Yleensä kielenkäyttö, josta jätetään pois ja lyhennetään asioita, on (hivenen tai paljon) epämuodollisempaa. En englannin suhteen osaa sanoa tässä tapauksessa, mutta esimerkiksi akateemisessa kirjoittamisessa suositellaan monien lyhenteiden välttämistä.
I was thinking about an algorithm that tries to calculate the novelty of sentences here on Tatoeba. What I came up with was this: the algorithm takes all sentences in a language, divides them into segments, and then counts how often each segment occurs. Then calculating the novelty of a sentence boils down to calculating the average of the number of times each segment in the sentence occurs in the corpus.
What I mean by segment is this: the segments of length 9 of the sentence
I wasn't very hungry anyway.
"I wasn't "
" wasn't v"
"n't very "
and so on. The lower the frequency score, the higher the novelty. Languages that use a larger character set or that have a smaller corpus would need to use a shorter segment length. The longer the length, the higher the number of distinct segments. If the length is too long, the algorithm will run out of memory.
I've written demos for you to try out.
The algorithm can be used to create lists of potentially more interesting sentences based on the novelty score or to give users some sort of feedback on their sentences.
It's possible to cheat the algorithm though by using uncommon names in sentences. Uncommon names don't add much to the corpus, but they'll lower the score anyway.
This is just an idea. Maybe someone else can improve on it.
Dictionary based compression algorithms like Lempel-Ziv basically already implement this in a way that's useable even for very large files. I guess you'd just take some FOSS code or implement the dictionary generation algorithm, then stop and count the dictionary size instead of compressing. Sounds kinda fun, I've never looked closely at how they work
Another member of Tatoeba (lbdx) has done some similar work :)
Oh... I didn't know about that.
His algorithm appears to work at a word level. I don't know exactly how he does it, but does he take into account that some languages have more conjugations and declensions and string together words, more than others? I had in mind an algorithm that would work at a character level.
> Does he take into account that some languages have more conjugations and declensions and string together words, more than others?
My novelty scoring method depends strongly on the morphological typology of the languages. Therefore, it is only useful when comparing sentences of the same language. But I think it is a convenient tool to monitor the balance of corpora and identify the most creative contributors in a language:
I'm sorry for asking, but does the Tatoeba project stand with Ukraine? (I have nowhere seen a message about the Russian invasion of Ukraine). And I feel like asking. I want to notice that most people in Ukraine are bilinguals, the Russian language is the native tongue for most people in Ukraine (if not for everyone).
We already had an argument about this a couple months ago, and I don't know if there was a choice of taking an official stance on the invasion or not
I know the argument was about replacing the Russian flag with another symbol (an alternative Russian flag that didn't reflect the government), and Trang opted not to because of the linguistic need of having a recognizable symbol for the language (the Russian flag), and suggested waiting a few years to see if Russians would recognize it as a symbol
Tatoeba is not a Ukrainian or Russian project and barely targets politics. Frankly, I'm glad it's like this. I'm glad that I don't feel forced into statements just because I find this project a valuable opportunity to share language content.
Also, I don't think Tatoeba as a community effort should even take responsibility for possible world views of people, as long as there is nothing illegal or threatening.
** Stats & Graphs **
Tatoeba Stats, Graphs & Charts have been updated:
You're welcome! :-)
I would like to know what your attitude is towards dialects or regional languages.
I asked several young people from different countries and almost all of them told me that "dialects are only spoken by old people."
Some dialect speakers avoid dialect words and keep their accent close to the standard language, but regional language speakers seem to have more positive attitudes than dialect speakers, although the first two situations also occur in them.
Do you speak a regional language or a dialect? What are your attitudes toward them?
I grew up with American influenced Spanish, and when I joined here with my Spanish still questionable, I learned a more standard dialect to avoid being chastised for contributing incorrect sentences. I think that the Spanish I learned and that my family spoke/speak is just as valid a communication language as the Spanish taught in schools.
I’m in a slightly similar situation with English where I’m still refining my skills and have some details in my speech that set me apart from speakers of standard English. Maybe it doesn’t follow all conventions, but it gets the job done.
In fact, almost everyone speaks a dialect. Only few people who was specially trained speak a "standard language". Regional languages are usually spoken by native ethnic groups. At least, that's what I know from my experience.
Dialects and general traditions like that are great, but Tatoeba is obviously more useful to most people when it's standard language. I'd actually like it if there was a way to subgroup languages on here, but people kinda do that with tags already
Standard language is also called a standard dialect.
I've added some sentences using dialects, but I mainly use standard language, partly because I'm in the part of the country where the standard dialect originates from and mainly because it can be understood everywhere clearly.
Murteilla ja puhekielellä kirjoitetut lauseet ovat yhtä tärkeitä, ja kenties tärkeämpiäkin, kuin kirjakieliset lauseet. Virallista, standardoitua kieltä löytää monesta paikasta, kun taas murteiden oppiminen on hankalampaa.
Kirjoitan silti pääosin kirjakielisiä lauseita, mutta teen joskus poikkeuksia. Mutta suuremman ja paremmin edustetun kielen tapauksessa erilaisten rekisterien rikkaus olisi todella rikkautta.
Virtually no Indonesians speak the formal Bahasa Indonesia, unless it's the news or we are deliberately talking to foreigners. Every day, our speech is mixed with the regional languages. I don't think these are dialects but rather code-mixes. Of course I think all of them are valid and it's fun listening to my friends from the other side of the country speaking with their own mixed features, but those are not suitable for a corpus, at least in the context of national standard language purposes.
In Tatoeba, almost all the sentences I make are in the standard language, though there are a couple sentences with *nation-wide* colloquialisms here and there (should probably be tagged haha). I avoid code-mixed sentences.
Re: regional languages
I speak Javanese and "acquired" a bit of Banjarese living in its area for two years. Like any languages, both of them have many dialects and some "main" prestigious dialects. In Tatoeba, I usually use dialect-neutral words, but sometimes I do write a dialectal sentence.
The thing about languages in Indonesia is that the writing/spelling systems are not standardized (or even if it is, common people don't know about that), leading to a chaos of uneducated Indonesianized spelling that aren't even faithful to the languages' morphophonology at times.
This, and to add with the lack of interest in regional language education, the mindset of "regional language = ethnocentrism!1! Use Indonesian instead!1!1!", and the official use of regional languages only as a "symbol" without usage campaigns, makes a lot of the languages here in danger.
Luckily there have been increasing interest in language preservation over the years and although we need big changes to show significant progress, small steps are also important, which is my reason to write Javanese (+Javanese script) and Banjarese sentences here too.
tl;dr I'd do my best to help saving regional languages because they are cool
Thank you all for your answers. :)