menu
Tatoeba
language
Nýskráning Innskrá
language Íslenska
menu
Tatoeba

chevron_right Nýskráning

chevron_right Innskrá

Vafra

chevron_right Sýna setningu af handahófi

chevron_right Vafra eftir tungumáli

chevron_right Vafra eftir lista

chevron_right Vafra eftir merki

chevron_right Vafra upptökum

Samfélag

chevron_right Veggur

chevron_right Meðlimalisti

chevron_right Listi tungumála meðlima

chevron_right Innfæddir

search
clear
swap_horiz
search

Veggur (7.265 þræðir)

Ábendingar

Áður en þú spyrð spurningu skaltu lesa lista algengra spurninga.

Við stefnum að því að viðhalda heilbrigðum staði fyrir siðmenntaðar umræður. Vinsamlegast lestu reglur gegn slæmum hegðum okkar.

Nýjustu skilaboð subdirectory_arrow_right

small_snow

3 klukkustundum síðan

subdirectory_arrow_right

AlanF_US

3 klukkustundum síðan

subdirectory_arrow_right

Igider

5 klukkustundum síðan

subdirectory_arrow_right

AlanF_US

9 klukkustundum síðan

subdirectory_arrow_right

cafoc64474

9 klukkustundum síðan

feedback

Igider

10 klukkustundum síðan

subdirectory_arrow_right

LeviHighway

11 klukkustundum síðan

subdirectory_arrow_right

small_snow

11 klukkustundum síðan

subdirectory_arrow_right

LeviHighway

12 klukkustundum síðan

feedback

LeviHighway

12 klukkustundum síðan

23 dögum síðan 12. nóvember 2025 kl. 08:45:47 UTC link Tengill
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

24 dögum síðan 11. nóvember 2025 kl. 13:59:14 UTC link Tengill
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

24 dögum síðan 11. nóvember 2025 kl. 11:43:10 UTC link Tengill
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

24 dögum síðan 11. nóvember 2025 kl. 09:40:24 UTC link Tengill
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

24 dögum síðan 11. nóvember 2025 kl. 07:58:20 UTC link Tengill
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

LeviHighway LeviHighway 24 dögum síðan 11. nóvember 2025 kl. 03:12:57 UTC flag Report link Tengill

I wish the automatically generated traditional/simplified Chinese could be editable, because it sometimes isn't correct. I as an advanced contributor cannot edit them, I'm not sure if corpus maintainers can? it's not written on the wiki.

{{vm.hiddenReplies[41424] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
gillux gillux 24 dögum síðan 11. nóvember 2025 kl. 05:17:43 UTC flag Report link Tengill

There are plans to make the traditional/simplified Chinese script editable. You can follow the progress here https://github.com/Tatoeba/tatoeba2/issues/2007

sacredceltic sacredceltic 30 dögum síðan 5. nóvember 2025 kl. 18:35:38 UTC flag Report link Tengill

On dirait que le fonctionnement des langues par défaut, pour les phrases insérées, a changé.
J'ai beau sélectionner "détection automatique", toutes les phrases que j'insère en anglais sont immédiatement identifiées comme des phrases en français, ce qui est parfaitement stupide.

{{vm.hiddenReplies[41393] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
gillux gillux 29 dögum síðan 6. nóvember 2025 kl. 10:53:12 UTC flag Report link Tengill

Rien n’a changé à ce niveau, si ce n’est que le modèle sur lequel s’appuie la détection des langues est mis à jour chaque semaine sur la base du corpus de Tatoeba (modulo les phrases étiquetées @wrong flag). Le modèle n’est jamais été parfait, notamment sur les phrases courtes.

{{vm.hiddenReplies[41400] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
LeviHighway LeviHighway 28 dögum síðan 7. nóvember 2025 kl. 13:50:41 UTC flag Report link Tengill

Can I learn more about the model? When I add Mandarin sentences, the model always detect it to be Cantonese. I know Mandarin and Cantonese are extremely close, so I never use the Detect function at all.

{{vm.hiddenReplies[41412] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
Thanuir Thanuir 26 dögum síðan 9. nóvember 2025 kl. 07:30:06 UTC flag Report link Tengill

Jos sinulla on isompi ja pienempi kieli jotka ovat hyvin samankaltaisia, ja lisäät lauseen pienempään, saattaa se olla algoritmin mielestä lähempänä isomman kielen lauseita.

Jos lauseessa on pienemmän kielen erityispiirteitä (joita suuremmassa ei ole), näin tapahtuu harvemmin.

{{vm.hiddenReplies[41414] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
EugeneGS EugeneGS 26 dögum síðan — breytt 26 dögum síðan 9. nóvember 2025 kl. 09:13:00 UTC — breytt 9. nóvember 2025 kl. 13:05:39 UTC flag Report link Tengill

Maybe there's also something wrong with the model architecture. I trained a few models myself — one on all Tatoeba data and one only on Mandarin and Cantonese — and both correctly detected about 97% of cases (checked on validation and full datasets).

What's strange is that the Tatoeba model seems to prefer Cantonese, even though it has fewer sentences than Mandarin.

Edit: I have tried another architecture with transformer layers (my first models had LSTM layers). After training on whole Tatoeba database it gave 82% accuracy.

{{vm.hiddenReplies[41415] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
frpzzd frpzzd 26 dögum síðan — breytt 26 dögum síðan 9. nóvember 2025 kl. 17:47:59 UTC — breytt 9. nóvember 2025 kl. 17:48:08 UTC flag Report link Tengill

Is your model training/testing code available online anywhere? If so, I would love to take a look for my own edification, since I've been learning about such topics recently.

{{vm.hiddenReplies[41418] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
EugeneGS EugeneGS 26 dögum síðan 9. nóvember 2025 kl. 21:35:22 UTC flag Report link Tengill

I've uploaded it on GitHub. The code can be used for pretty much any text classification task.
I honestly didn't expect anyone to be interested, so I'm glad you asked! Some comments in the code might not be super helpful, but if anything's unclear, feel free to reach out via private messages.

https://github.com/kilsense/Tex...2f07/main/LSTM

LeviHighway LeviHighway 26 dögum síðan 9. nóvember 2025 kl. 20:56:51 UTC flag Report link Tengill

lol I correct myself, it's not *always* Cantonese, but it's pretty frequent. I noticed that most Cantonese sentences on Tatoeba are very long sentences, I guess that affected the model.

{{vm.hiddenReplies[41419] ? 'expand_more' : 'expand_less'}} fela svör sýna svör
Ooneykcall Ooneykcall 26 dögum síðan 9. nóvember 2025 kl. 21:09:39 UTC flag Report link Tengill

I've noticed there are some weird accounts adding many, usually long, Cantonese sentences often as translations from other languages including Russian (that's why I noticed it), whose quality I suspect is questionable, but unfortunately there are no active native speakers of Cantonese at the moment that could be dealing with that.

gillux gillux 24 dögum síðan 11. nóvember 2025 kl. 05:14:28 UTC flag Report link Tengill

Now you mention it, there could be a bias related to traditional/simplified characters. The model only considers the sentence script, not the autogenerated alternative script. As for Mandarin Chinese, 57% of sentences use simplified characters and 43% use traditional, while Cantonese only uses traditional.

Anyway, the language detector on Tatoeba is based on ngrams statistics, which is very old school compared to the technology available nowadays like transformers. Anybody is welcome to improve or even rewrite it https://github.com/Tatoeba/Tatodetect

25 dögum síðan 10. nóvember 2025 kl. 15:18:38 UTC link Tengill
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

25 dögum síðan 10. nóvember 2025 kl. 09:21:50 UTC link Tengill
warning

The content of this message goes against our rules and was therefore hidden. It is displayed only to admins and to the author of the message.

sharptoothed sharptoothed 26 dögum síðan 9. nóvember 2025 kl. 16:57:41 UTC flag Report link Tengill

✹✹ Stats & Graphs ✹✹

Tatoeba Stats, Graphs & Charts have been updated:
https://tatoeba.j-langtools.com/allstats/