Mur (7 365 sujets)
Astuces
Avant de poser une question, soyez sûr d'avoir lu la FAQ.
Nous cherchons à maintenir une ambiance amicale pour des discussions civilisées. Veuillez lire nos règles contre les mauvais comportements.
LeviHighway
il y a 5 heures
LeviHighway
il y a 5 heures
ssvb
il y a 8 heures
zogwarg
il y a 12 heures
AlanF_US
il y a 21 heures
Vortarulo
il y a 21 heures
DostKaplan
il y a 6 jours
AlanF_US
il y a 6 jours
AlanF_US
il y a 6 jours
LeviHighway
il y a 6 jours
I have always got this idea for Tatoeba but I understand how hard it is to achieve or it might doesn't fit the purpose of Tatoeba. But here I want to share it:
I wish that there is a "dictionary" function for Tatoeba. It doesn't mean you have to add definitions or etymology to words, but it should be a translation dictionary.
Here's how I think it should work:
When you search for "Chinese" via this dictionary function, you might get a few sentences like this:
I am Chinese. - 我是中國人。
I speak Chinese. - 我說中文。
I wish the contributors can tag how the word "Chinese" is translated into the target language. for example, you can mark:
I am [Chinese]. - 我是[中國人]。
I speak [Chinese]. - 我說[中文]。
This way, we can get a statistic how the word "Chinese" is translated into the target language. For example, when you look up "Chinese", it can tell you that "Chinese" is most usually translated into "中國人", and then followed by "中文".
I think Tatoeba is the best place to do this. Wiktionary is too complicated, and it can never provide as many sentences as Tatoeba.
I think there are many limits to make this work with a "Tag" system:
- It would require a lot of extra work from contributors.
- For most language pairs, there is very poor word for word matching anyway. eg: [#3378275] > [#4919050]
I think pragmatically using advanced search, if can understand both the source and target languages you can already get a good idea of how a word gets translated on tatoeba (tatoeba is not necessarily representative):
https://tatoeba.org/en/sentence...rd_count_min=1
Interestingly for "Chinese" the breakdown seems to more be like:
148 "中文"
63 "汉语"
58 "中国"
42 "中国人"
18 "漢語"
16 "汉字"
16 "中國人"
16 "中國"
14 "漢字"
11 "中"
10 "中餐"
7 "普通话"
4 "国"
2 "华人"
2 "中式"
1 "華"
1 "繁体字"
1 "简体字"
1 "汉"
1 "本地"
1 "國字"
1 "华"
1 "中菜"
1 "中华"
19 "other (mostly implicitly Chinese things)"
I think for such sentences that has poor word to word matching, we should simply add a button to mark them as "implied" or something.
For the example of [#3378275] > [#4919050], we can not match "Chinese" to anything, so we can mark it as "implied". However, we can still create matching like [Chinese characters] > [字] and [characters] > [字]
One of the reasons why I give this idea is that, some people may not be able to understand both languages well. So a matching of how words are translated could help them understand the sentence structures easier.
> I think Tatoeba is the best place to do this. Wiktionary is too complicated,
If Tatoeba implements your suggestions, then it will become more complicated. Possibly even as much complicated as Wiktionary.
BTW, you are not required to provide etymology when contributing to Wiktionary. A lot of other information is also optional. You can try to figure out what is the minimal barebone entry for a Chinese word and maybe discuss this with the other contributors if something is not clear. Start from here: https://en.wiktionary.org/wiki/...try_guidelines
I guess, one of the challenges is that Wiktionary treats Chinese as a big family of many similar languages or subdialects (Mandarin, Cantonese, Wu and the others), each having its own distinctive features. And this may cause friction, because some people may be in favour of eradicating the differences for the sake of simplification and unification, while the other people may be interested in preserving their local dialect as their precious cultural heritage.
> and it can never provide as many sentences as Tatoeba.
Wiktionary surely has stricter quality requirements for its content, so it indeed can't match Tatoeba's quantity of sentences. That said, the Tatoeba's content is also categorized via tags and it's possible to filter out dubious sentences (contributed by non-native speakers, etc.).
I don’t frequently contribute to the English Wiktionary, which seems more streamlined due to its large community of maintainers. However, as a contributor to the Chinese Wiktionary, I find the infrastructure much more challenging. Many templates and modules are incomplete or overly complex; editing a single entry often takes hours and is prone to errors.
Furthermore, Chinese grammar is often a subject of intense debate, frequently leading to edit wars. I’ve also noticed inaccuracies in Korean and Chinese entries, particularly when they are translated from English by contributors who may not be proficient in the target language.
My point is that we could effectively build a translation dictionary within Tatoeba. For instance, when looking up a word, seeing both the keyword and its translation in bold—similar to how some dictionaries highlight equivalents—would greatly benefit learners in understanding sentence structures.
Regarding the preservation of Chinese languages, Tatoeba handles this exceptionally well. By treating Mandarin, Cantonese, Wu, and others as distinct entities, it helps maintain the linguistic integrity and "purity" of each language.
Le contenu de ce message va à l'encontre de nos règles et a donc été caché. Il est seulement montré aux administrateurs et à l'auteur du message.
Le contenu de ce message va à l'encontre de nos règles et a donc été caché. Il est seulement montré aux administrateurs et à l'auteur du message.
Is it acceptable for a member who is not fluent in a language (but understands it to a good extend) to use Google Translate (or similar) to translate sentences into that language?
For instance, could someone with an A2 knowledge of Polish use GT to translate sentences from English to Polish, proof-reading them for obvious mistakes, and then putting them into the corpus?
I think machine translation is acceptable when you can assure its quality. Also, not to mention there're a lot of native speakers write bad translations, and those are not even as good as machine translations.
En näe tämän tuottavan merkittävää lisäarvoa tietokannalle. Suosittelisin mieluummin kääntämään puolasta omalle äidinkielelle tai muuten riittävän vahvalle kielelle, että voi mennä takuuseen omista lauseistaan, ja linkittämään ymmärtämiään puolalaisia lauseita muunkielisiin, jotka myös ymmärtää.
I think it is acceptable. Translators (such as GT or DeepL) nowadays are pretty good. Also, it is not such a problem if that person translates simple sentences or translates into language that is pretty similar to their well-known languages.
I would rather see people contributing in their own native languages.
Machine translation with AI has become much better than it used to be, so I can understand the temptation to use it.
Using AI to translate from a foreign language you know into your own native language might help you think of wordings you might not have otherwise thought of.
If you, as a native speaker, verify that it sounds natural, and if you know the source language well enough to verify that the meaning is the same, I think AI can be a useful tool.
Verifying that the meaning is the same is highly problematic. That's why I started translating sentences with the "by Arthur Conan Doyle" tag and submitted by a native English speaker. The best part about them is that I can look up context (the text before and after them) by searching them on the Internet. This is much better than many of the other short sentences on Tatoeba that are ambiguous regarding the gender of the speaker or other things.
> I started translating sentences with the "by Arthur Conan Doyle" tag
Oh, and to make things perfect, I would like to have CC0 license for both the English sentences and my translations. The works of Arthur Conan Doyle are public domain now. Is this possible?
I agree.
Contributing in a language that is not your strongest
https://en.wiki.tatoeba.org/art...ow/non-native#
Thanks for that link, @marafon. I also encourage people to consult the Rules and Guidelines ( https://en.wiki.tatoeba.org/art...ow/guidelines# ). To find it in the future, go to the bottom of any Tatoeba page and click on "Tatoeba Wiki". The first section on that page contains a link to the "Rules and Guidelines" page.
Tatoeba's mission is to serve as a source of high-quality sentences and high-quality translations. This is ensured by having them written, verified, and owned by humans who know the languages well enough to avoid introducing any mistakes, including subtle ones, not just the obvious ones mentioned by @Vortarulo.
If Tatoeba starts also acting as a consumer of sentences that do not come directly, transparently, and legally from humans, we run the very real risk of generating a cycle in which we pass these poor-quality or misappropriated sentences on to the people who get them from us.
Thanks for your answers, everyone, especially the link to the Guidelines, AlanF_US. I had looked for them but didn't find them. This also goes with my impression.
The question wasn't about me, by the way, but about a user who contributes quite a lot here, but apparently with mostly(?) GT-translated sentences. Perhaps I should point them to the Guidelines.
How do you know that the translations are from Google Translate?
> How do you know that the translations are from Google Translate?
It's not a rocket science. Sometimes a translation is obviously wrong. And when the original sentence is fed into Google Translate, the bad translation precisely matches the Google Translate output.
> Sometimes a translation is obviously wrong. And when the original sentence is fed into Google Translate, the bad translation precisely matches the Google Translate output.
Yes, that's how I detect AI translations too.
And secondly, when a user adds 10 or more translations in different languages, it's hard to believe that they speak so many languages.
Just as a sidenote: I recently translated a sentence containing "Tatoeba" with DeepL and in the translation "Tatoeba" was replaced by "wiktionary".
First of all, if you see that a user is contributing a bad translation, then regardless of what you think the source of that translation is, you should be leaving a comment on the translation and tagging it for action (for instance, "@change") if you can. If you see that the user habitually contributes bad translations, and contacting the user is either not feasible or has no effect, you should send a private message or an email to a corpus maintainer, an admin, or the admin team.
There are several ways to come to the conclusion that someone is using AI. One is that the person says so outright. Others are along the lines of what @ssvb and @PaulP have mentioned, namely that a bad translation happens to match, say, Google Translate's output. While this is not proof, if the evidence is pretty strong (for instance, if there are a lot of similar incidences), it's also worth mentioning to the user and/or an administrator.
I urge people to read the Rules and Guidelines in general. I've submitted an issue to request that a link to the document be added to the footer on each Tatoeba page to make it easier to find.
I checked one of the more complicated sentences (it was in Latin, I believe), because it seemed a bit special. Then I saw that the Google Translate version was exactly the same... and so were Greek, Italic, Galician, Portuguese, Spanish, Turkish, etc. etc.
I asked the user, and they admitted using GT.
> I asked the user, and they admitted using GT.
Then they should stop.
Le contenu de ce message va à l'encontre de nos règles et a donc été caché. Il est seulement montré aux administrateurs et à l'auteur du message.
What is the regexp to search for sentences (Turkish to English) containing " ten" (with a leading space)?
xxxxxx ten xxxxxxx ✅
xxxxxx'ten xxxxxxx 👎🏼
xxxxxxten xxxxxxxx 👎🏼
I assume you're asking about an expression accepted by Tatoeba's integrated search function, which is similar but not identical to a regular expression ("regexp").
I don't believe there is a way to write an expression that finds sentences where "ten" is a standalone word but excludes sentences containing a word that ends with "ten" preceded by an apostrophe. The reason is that the tokenization and search split words not only at word boundaries indicated by spaces, but also at punctuation marks (including apostrophe), which are then discarded. One hacky way to get what you want would be to search for "=ten" and then use the browser's search function with "Whole Words" enabled to search through the results.
To get the full functionality you're looking for, I think you'd have to download the sentences you want and then use a tool (such as a text editor's search function with regular expression search enabled) that gives you this level of control.
"=ten" would be great if it works as expected (return sentences with standalone "ten"). Unfortunately, it also returns sentences containing "'ten" (with a leading apostrophe).
I want Sentence #4940054:
Onun güzel bir ten rengi var.
I don't want Sentence #5512986:
2013'ten beri buradayız.
Luckily there are only 3 pages of results, so I can just scroll through and eyeball them. But it would have been nice to be able to get only "ten" and not "'ten".
Le contenu de ce message va à l'encontre de nos règles et a donc été caché. Il est seulement montré aux administrateurs et à l'auteur du message.
Le contenu de ce message va à l'encontre de nos règles et a donc été caché. Il est seulement montré aux administrateurs et à l'auteur du message.
Le contenu de ce message va à l'encontre de nos règles et a donc été caché. Il est seulement montré aux administrateurs et à l'auteur du message.
How can I easily watch my old sentences, e.g. page 2000 ?
https://tatoeba.org/de/sentence...ster?page=2000
You can change the number in the url.