menu
تاتیبہ
language
رجسٹر لاگ ان
language سرائیکی
menu
تاتیبہ

chevron_right رجسٹر

chevron_right لاگ ان

براؤز

chevron_right رینڈم جملے ݙکھاؤ

chevron_right زبان نال براؤز کرو

chevron_right تندیر نال براؤز کرو

chevron_right ٹیگ نال براؤز کرو

chevron_right آڈیو براؤز کرو

برادری

chevron_right وال

chevron_right سارے ممبراں دی تندیر

chevron_right ممبراں دیاں زباناں

chevron_right مقامی ٻولݨ آ لے

search
clear
swap_horiz
search

وال (ہک تند)

گُر

Before asking a question, make sure to read the FAQ.

We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.

تازہ ترین سنیہے subdirectory_arrow_right

gillux

ہک منٹ پہلے

subdirectory_arrow_right

gillux

ہک منٹ پہلے

feedback

LeviHighway

ہک گھنٹہ پہلے

subdirectory_arrow_right

EugeneGS

کل

subdirectory_arrow_right

Ooneykcall

کل

subdirectory_arrow_right

LeviHighway

کل

subdirectory_arrow_right

frpzzd

کل

feedback

sharptoothed

کل

subdirectory_arrow_right

EugeneGS

کل

subdirectory_arrow_right

Thanuir

کل

LeviHighway LeviHighway ہک گھنٹہ پہلے November 11, 2025 at 3:12:57 AM UTC flag Report link پرمالنک

I wish the automatically generated traditional/simplified Chinese could be editable, because it sometimes isn't correct. I as an advanced contributor cannot edit them, I'm not sure if corpus maintainers can? it's not written on the wiki.

{{vm.hiddenReplies[41424] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
gillux gillux ہک منٹ پہلے November 11, 2025 at 5:17:43 AM UTC flag Report link پرمالنک

There are plans to make the traditional/simplified Chinese script editable. You can follow the progress here https://github.com/Tatoeba/tatoeba2/issues/2007

sacredceltic sacredceltic کل November 5, 2025 at 6:35:38 PM UTC flag Report link پرمالنک

On dirait que le fonctionnement des langues par défaut, pour les phrases insérées, a changé.
J'ai beau sélectionner "détection automatique", toutes les phrases que j'insère en anglais sont immédiatement identifiées comme des phrases en français, ce qui est parfaitement stupide.

{{vm.hiddenReplies[41393] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
gillux gillux کل November 6, 2025 at 10:53:12 AM UTC flag Report link پرمالنک

Rien n’a changé à ce niveau, si ce n’est que le modèle sur lequel s’appuie la détection des langues est mis à jour chaque semaine sur la base du corpus de Tatoeba (modulo les phrases étiquetées @wrong flag). Le modèle n’est jamais été parfait, notamment sur les phrases courtes.

{{vm.hiddenReplies[41400] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
LeviHighway LeviHighway کل November 7, 2025 at 1:50:41 PM UTC flag Report link پرمالنک

Can I learn more about the model? When I add Mandarin sentences, the model always detect it to be Cantonese. I know Mandarin and Cantonese are extremely close, so I never use the Detect function at all.

{{vm.hiddenReplies[41412] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
Thanuir Thanuir کل November 9, 2025 at 7:30:06 AM UTC flag Report link پرمالنک

Jos sinulla on isompi ja pienempi kieli jotka ovat hyvin samankaltaisia, ja lisäät lauseen pienempään, saattaa se olla algoritmin mielestä lähempänä isomman kielen lauseita.

Jos lauseessa on pienemmän kielen erityispiirteitä (joita suuremmassa ei ole), näin tapahtuu harvemmin.

{{vm.hiddenReplies[41414] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
EugeneGS EugeneGS کل، ایڈت تھیا کل November 9, 2025 at 9:13:00 AM UTC، ایڈت تھیا November 9, 2025 at 1:05:39 PM UTC flag Report link پرمالنک

Maybe there's also something wrong with the model architecture. I trained a few models myself — one on all Tatoeba data and one only on Mandarin and Cantonese — and both correctly detected about 97% of cases (checked on validation and full datasets).

What's strange is that the Tatoeba model seems to prefer Cantonese, even though it has fewer sentences than Mandarin.

Edit: I have tried another architecture with transformer layers (my first models had LSTM layers). After training on whole Tatoeba database it gave 82% accuracy.

{{vm.hiddenReplies[41415] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
frpzzd frpzzd کل، ایڈت تھیا کل November 9, 2025 at 5:47:59 PM UTC، ایڈت تھیا November 9, 2025 at 5:48:08 PM UTC flag Report link پرمالنک

Is your model training/testing code available online anywhere? If so, I would love to take a look for my own edification, since I've been learning about such topics recently.

{{vm.hiddenReplies[41418] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
EugeneGS EugeneGS کل November 9, 2025 at 9:35:22 PM UTC flag Report link پرمالنک

I've uploaded it on GitHub. The code can be used for pretty much any text classification task.
I honestly didn't expect anyone to be interested, so I'm glad you asked! Some comments in the code might not be super helpful, but if anything's unclear, feel free to reach out via private messages.

https://github.com/kilsense/Tex...2f07/main/LSTM

LeviHighway LeviHighway کل November 9, 2025 at 8:56:51 PM UTC flag Report link پرمالنک

lol I correct myself, it's not *always* Cantonese, but it's pretty frequent. I noticed that most Cantonese sentences on Tatoeba are very long sentences, I guess that affected the model.

{{vm.hiddenReplies[41419] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
Ooneykcall Ooneykcall کل November 9, 2025 at 9:09:39 PM UTC flag Report link پرمالنک

I've noticed there are some weird accounts adding many, usually long, Cantonese sentences often as translations from other languages including Russian (that's why I noticed it), whose quality I suspect is questionable, but unfortunately there are no active native speakers of Cantonese at the moment that could be dealing with that.

gillux gillux ہک منٹ پہلے November 11, 2025 at 5:14:28 AM UTC flag Report link پرمالنک

Now you mention it, there could be a bias related to traditional/simplified characters. The model only considers the sentence script, not the autogenerated alternative script. As for Mandarin Chinese, 57% of sentences use simplified characters and 43% use traditional, while Cantonese only uses traditional.

Anyway, the language detector on Tatoeba is based on ngrams statistics, which is very old school compared to the technology available nowadays like transformers. Anybody is welcome to improve or even rewrite it https://github.com/Tatoeba/Tatodetect

ہک گھنٹہ پہلے November 10, 2025 at 3:18:38 PM UTC link پرمالنک
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

ہک گھنٹہ پہلے November 10, 2025 at 9:21:50 AM UTC link پرمالنک
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

sharptoothed sharptoothed کل November 9, 2025 at 4:57:41 PM UTC flag Report link پرمالنک

✹✹ Stats & Graphs ✹✹

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/

کل November 9, 2025 at 9:29:31 AM UTC link پرمالنک
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

LeviHighway LeviHighway کل November 7, 2025 at 2:31:54 PM UTC flag Report link پرمالنک

Please enable Traditional-Simplified convertion to Literary Chinese. Currently, people are contributing in either Traditional or Simplified characters. So I think it needs convertion just like Mandarin Chinese.

LeviHighway LeviHighway کل November 7, 2025 at 4:16:50 AM UTC flag Report link پرمالنک

Does anyone know any website that is similar to the Tatoeba mechanism but is for vocabularies? I know Glosbe but it seems they doesn't ensure quality at all.

{{vm.hiddenReplies[41406] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
frpzzd frpzzd کل November 7, 2025 at 5:16:08 AM UTC flag Report link پرمالنک

I second this question. I've also seen Glosbe but haven't contributed to it myself because (1) it is not open source, and (2) it does not (as far as I'm aware) allow for bulk data download.

This is probably not what you want because it is mainly between German and other European languages, but I like dict.cc because they allow you to download the dictionary data in its entirety (but it must be requested by email).
https://www.dict.cc

And of course, there is always Wikitionary.

What do you have in mind exactly with "ensuring quality"? Even here on Tatoeba there seems to be quite a bit of debate sometimes when it comes to correcting sentences.

{{vm.hiddenReplies[41408] ? 'expand_more' : 'expand_less'}} جواب لکاؤ جواب ݙکھاؤ
LeviHighway LeviHighway کل November 7, 2025 at 6:03:48 AM UTC flag Report link پرمالنک

well Glosbe does not have a community/comment function, and I never managed to contact their staff. it's not organized at all, all contributions are kept so it's a total mess. you have no control of anything, your contribution might be hidden at the bottom etc. Tatoeba is much better. I don't like (Chinese) Wiktionary, because it's so complicated, contributing to one entry usually takes a a day.

کل November 7, 2025 at 9:26:17 AM UTC link پرمالنک
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

LeviHighway LeviHighway کل November 7, 2025 at 4:27:41 AM UTC flag Report link پرمالنک

the only Chinese corpse maintainer is inactive. how should we deal with hundreds of Chinese sentences that need to be changed?
https://tatoeba.org/zh-cn/tags/...direction=desc