Thread #33106 - Tatoeba

Dear fellow Tatoebans!

As you probably know, the search engine on Tatoeba treats some characters specially. For example, the search results for ="tom", ="Tom", ="ToM" and so on are the same, and searching for ="?" doesn't find anything.

However, recently it was discovered ( https://tatoeba.org/eng/wall/sh...#message_32566 ) that the German ß was not treated correctly. Since such a common character in the sixth-most used language was affected, I suspected that there might be similar problems with characters in other languages and decided to investigate. As it turns out, there are quite a lot!

Some background on the search engine's inner workings: when someone searches for a word like "ToM", a list of mappings T → t, o → o, M → m is used to transform it into "tom" and that is the word that will really be searched for. Any unknown characters are removed. Using this system, it is not possible to treat for example Latin "A", Greek "Α" and Cyrillic "А" as identical without also doing the same for their lower-case equivalents "a", "α" and "а". That would mean Greek search results for words using the Latin alphabet, which may not be desirable.

So when deciding which characters to make the same, it's necessary to make a tradeoff between finding more results with variant characters and still being able to distinguish distinct words that only differ in which variant they use. Fortunately, it's possible to make a different decision for each language.

Below I have listed characters for which changes may need to be made and the affected languages with sentences containing those characters. I hope everyone reading this can take some time looking at the languages they speak and consider what the best decision in each case is.

Disclaimer: I do not speak for the Tatoeba team and there's no guarantee anything will happen as a result of this post.

# Duplicate Encodings

For historical reasons, there are some identical characters that have multiple computer codes to represent them, but there shouldn't be any need to distinguish them.

ά → ά έ → έ ή → ή ί → ί ό → ό ύ → ύ ώ → ώ Affects: Ancient Greek [grc]

不 → 不粒 → 粒行 → 行 Affects: Cantonese [yue], Literary Chinese [lzh]

# Duplicate Encodings (multiple codepoints)

I'm listing characters involving multiple codepoints separately, because they require larger changes to Tatoeba's search engine. What is a codepoint? Think about it like the keys you press on a keyboard to type a character like "à": the key for the accent and the key for "a". So "à" consists of two codepoints. But there's also "à" with a single codepoint, like a keyboard with a special key to type "à" directly.

à → à á → á â → â ã → ã ä → ä ả → ả å → å ạ → ạ ć → ć ĉ → ĉ ç → ç è → è é → é ê → ê ẹ → ẹ ę → ę ĝ → ĝ ḥ → ḥ ì → ì í → í ỉ → ỉ ị → ị ĵ → ĵ ň → ň ò → ò ó → ó õ → õ ö → ö ỏ → ỏ ọ → ọ ǫ → ǫ ṛ → ṛ ŝ → ŝ ṣ → ṣ ş → ş ṭ → ṭ ù → ù ú → ú ũ → ũ ŭ → ŭ ü → ü ủ → ủ ụ → ụ ý → ý ẓ → ẓ ầ → ầ ấ → ấ ẫ → ẫ ậ → ậ ề → ề ế → ế ễ → ễ ệ → ệ ố → ố ỗ → ỗ ổ → ổ ằ → ằ ắ → ắ ẳ → ẳ ặ → ặ ờ → ờ ớ → ớ ở → ở ợ → ợ ừ → ừ ứ → ứ ữ → ữ ử → ử ự → ự Affects: Berber [ber], Cayuga [cay], Esperanto [epo], Finnish [fin], French [fra], Hungarian [hun], Interlingue [ile], Italian [ita], Kabyle [kab], Lingala [lin], Navajo [nav], Russian [rus], Serbian [srp], Shuswap [shs], Spanish [spa], Swedish [swe], Tatar [tat], Turkish [tur], Turkmen [tuk], Vietnamese [vie], Yoruba [yor]

й → й Affects: Bashkir [bak]

آ → آ أ → أ ؤ → ؤ Affects: Arabic [ara], Persian [pes], Urdu [urd]

ऱ → ऱ क़ → क़ ख़ → ख़ ग़ → ग़ ज़ → ज़ ड़ → ड़ ढ़ → ढ़ फ़ → फ़ Affects: Garhwali [gbm], Hindi [hin], Marathi [mar]

ড় → ড় ঢ় → ঢ় য় → য় Affects: Assamese [asm], Bengali [ben]

ਸ਼ → ਸ਼ ਖ਼ → ਖ਼ ਗ਼ → ਗ਼ ਜ਼ → ਜ਼ ਫ਼ → ਫ਼ Affects: Punjabi (Eastern) [pan]

ோ → ோ Affects: Tamil [tam]

ೀ → ೀ ೊ → ೊ ೋ → ೋ ೇ → ೇ Affects: Kannada [kan]

ോ → ോ Affects: Malayalam [mal]

יִ → יִ ײַ → ײַ שׂ → שׂ אַ → אַ אָ → אָ וּ → וּ כּ → כּ פּ → פּ תּ → תּ בֿ → בֿ כֿ → כֿ פֿ → פֿ Affects: Hebrew [heb], Yiddish [yid]

ָֹ → ָֹ ְּ → ְּ ֳּ → ֳּ ִּ → ִּ ֵּ → ֵּ ֶּ → ֶּ ַּ → ַּ ָּ → ָּ ֹּ → ֹּ ֻּ → ֻּ ְׁ → ְׁ ִׁ → ִׁ ֶׁ → ֶׁ ַׁ → ַׁ ָׁ → ָׁ ֹׁ → ֹׁ ֻׁ → ֻׁ ְׂ → ְׂ ִׂ → ִׂ ֵׂ → ֵׂ ָׂ → ָׂ ֹׂ → ֹׂ َّ → َّ ُّ → ُّ ِّ → ِّ ़् → ़् ့် → ့် Affects: Algerian Arabic [arq], Arabic [ara], Burmese [mya], Hebrew [heb], Hindi [hin], North Levantine Arabic [apc], Persian [pes], Yiddish [yid]

# Near Duplicates

There are some characters which usually look slightly different, but can be used for the same purpose in many situations. The question is whether searching for them on Tatoeba is one of those situations.

ª → a º → o Affects: Danish [dan], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Interlingua [ina], Italian [ita], Japanese [jpn], Lingua Franca Nova [lfn], Portuguese [por], Russian [rus], Spanish [spa], Turkish [tur], Ukrainian [ukr]

² → 2 ³ → 3 ¹ → 1 ⁰ → 0 ⁸ → 8 ⁿ → n Affects: Basque [eus], Choctaw [cho], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Hebrew [heb], Hungarian [hun], Interlingua [ina], Irish [gle], Italian [ita], Japanese [jpn], Mandarin Chinese [cmn], Polish [pol], Portuguese [por], Russian [rus], Shanghainese [wuu], Spanish [spa], Turkish [tur], Ukrainian [ukr]

₁ → 1 ₂ → 2 ₙ → n Affects: Basque [eus], Czech [ces], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Hungarian [hun], Interlingua [ina], Italian [ita], Japanese [jpn], Kabyle [kab], Macedonian [mkd], Marathi [mar], Portuguese [por], Russian [rus], Spanish [spa], Turkish [tur], Ukrainian [ukr], Vietnamese [vie]

① → 1 ② → 2 Affects: Japanese [jpn]

𝑎 → a 𝑏 → b 𝑐 → c 𝑒 → e 𝑖 → i 𝑘 → k 𝑚 → m 𝑛 → n 𝑟 → r 𝑥 → x 𝑦 → y 𝘨 → g 𝜀 → ε 𝜋 → π Affects: Esperanto [epo], German [deu], Russian [rus], Spanish [spa]

ℎ → h Affects: German [deu]

ϕ → φ ℵ → א Affects: Ancient Greek [grc], German [deu]

ʰ → h ʷ → w ᵉ → e ᵗ → t ⵯ → ⵡ Affects: Berber [ber], English [eng], French [fra], Kabyle [kab], Khmer [khm], Ngeq [ngt], Waray [war]

ſ → s Affects: Middle French [frm]

ﮐ → ک ﺋ → ئ ﺎ → ا ﺣ → ح ﺹ → ص ﻊ → ع ﻋ → ع ﻞ → ل ﻠ → ل ﻣ → م ﻪ → ه Affects: Ottoman Turkish [ota]

⺟ → 母⼀ → 一⾯ → 面⾷ → 食 Affects: Min Nan Chinese [nan]

# Near Duplicates (multiple codepoints)

ĳ → ij և → եւ ﬁ → fi ﻹ → لإ ﻻ → لا ﻼ → لا Affects: Arabic [ara], Armenian [hye], Dutch [nld], Irish [gle], Ottoman Turkish [ota]

㌔ → キロ㌘ → グラム Affects: Japanese [jpn]

ำ → ํา Affects: Thai [tha]

ໜ → ຫນ ໝ → ຫມ Affects: Lao [lao]

# Case Alternatives

Some characters can look very different when changing between upper case and lower case, but they should probably still be treated the same when searching.

H → h I → ı J → j U → u W → w Á → á Â → â Ä → ä Å → å É → é Ú → ú Ā → ā Č → č Ē → ē Ġ → ġ Ĥ → ĥ Ī → ī İ → i ı → i Ĵ → ĵ Ļ → ļ Ľ → ľ Ł → ł Ņ → ņ Ŝ → ŝ Ū → ū ℂ → c ℃ → c ℕ → n ℝ → r Ꞌ → ꞌ 𝐴 → a 𝐵 → b 𝐾 → k 𝑁 → n 𝑋 → x Affects: Azerbaijani [aze], Bashkir [bak], Berber [ber], Chamorro [cha], Chuvash [chv], Crimean Tatar [crh], Croatian [hrv], Czech [ces], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Greek [ell], Hungarian [hun], Ido [ido], Italian [ita], Japanese [jpn], Kashmiri [kas], Kashubian [csb], Latin [lat], Latvian [lvs], Lojban [jbo], Lower Sorbian [dsb], Navajo [nav], Old East Slavic [orv], Ottoman Turkish [ota], Polish [pol], Portuguese [por], Russian [rus], Slovak [slk], Spanish [spa], Talysh [tly], Tatar [tat], Turkish [tur], Turkmen [tuk], Unknown Language, Upper Sorbian [hsb], Zaza [zza]

Ԑ → ԑ Affects: Kabyle [kab]

¨ → ̈ ´ → ́ ˙ → ̇ ˚ → ̊ Affects: Ancient Greek [grc], Berber [ber], Catalan [cat], Czech [ces], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Greek [ell], Guarani [grn], Italian [ita], Low German (Low Saxon) [nds], Mandarin Chinese [cmn], Occitan [oci], Old Tupi [tpw], Portuguese [por], Slovak [slk], Spanish [spa], Turkish [tur], Ukrainian [ukr]

𑢩 → 𑣉 𑢮 → 𑣎 𑢯 → 𑣏 Affects: Ho [hoc]

ͅ → ι ΄ → ́ Ά → α Έ → ε Ή → ή Ί → ι Ό → ο Ύ → υ Ώ → ω ΐ → ϊ ά → α έ → ε ί → ι ς → σ ό → ο ύ → υ ώ → ω ἀ → α ἁ → α ἄ → α Ἀ → ἀ Ἄ → α Ἄ → ἄ Ἆ → ἆ ἐ → ε ἔ → ε ἕ → ε Ἐ → ἐ Ἑ → ἑ Ἓ → ε Ἓ → ἓ Ἔ → ε Ἔ → ἔ ἠ → η ἡ → η ἦ → ή Ἡ → ἡ Ἢ → ἢ Ἥ → ἥ Ἦ → ἦ ἰ → ι ἱ → ι ἶ → ι Ἰ → ἰ Ἱ → ἱ ὁ → ο ὅ → ο Ὀ → ὀ Ὁ → ο Ὁ → ὁ Ὃ → ὃ Ὄ → ὄ Ὅ → ὅ ὐ → υ ὔ → υ Ὑ → ὑ Ὕ → ὕ ὠ → ω ὡ → ω Ὡ → ὡ Ὤ → ὤ Ὦ → ὦ ὰ → α ὲ → ε έ → ε ὴ → ή ὶ → ι ί → ι ὸ → ο ὺ → υ ὼ → ω ώ → ω ᾶ → α ᾽ → ̓ ᾿ → ̓ ῆ → ή ῖ → ι ῦ → υ ῶ → ω ῾ → ̔ Affects: Ancient Greek [grc], Greek [ell], Portuguese [por]

Ա → ա Բ → բ Գ → գ Դ → դ Ե → ե Զ → զ Է → է Ը → ը Թ → թ Ժ → ժ Ի → ի Լ → լ Խ → խ Ծ → ծ Կ → կ Հ → հ Ձ → ձ Ղ → ղ Ճ → ճ Մ → մ Յ → յ Ն → ն Շ → շ Ո → ո Չ → չ Պ → պ Ջ → ջ Ս → ս Վ → վ Տ → տ Ց → ց Ւ → ւ Փ → փ Ք → ք Օ → օ Ֆ → ֆ Affects: Armenian [hye]

Ꭰ → ꭰ Ꭱ → ꭱ Ꭴ → ꭴ Ꭶ → ꭶ Ꭷ → ꭷ Ꭸ → ꭸ Ꭹ → ꭹ Ꭺ → ꭺ Ꭼ → ꭼ Ꭽ → ꭽ Ꭿ → ꭿ Ꮂ → ꮂ Ꮃ → ꮃ Ꮅ → ꮅ Ꮆ → ꮆ Ꮈ → ꮈ Ꮎ → ꮎ Ꮑ → ꮑ Ꮒ → ꮒ Ꮓ → ꮓ Ꮕ → ꮕ Ꮖ → ꮖ Ꮗ → ꮗ Ꮙ → ꮙ Ꮛ → ꮛ Ꮜ → ꮜ Ꮝ → ꮝ Ꮟ → ꮟ Ꮡ → ꮡ Ꮢ → ꮢ Ꮣ → ꮣ Ꮤ → ꮤ Ꮥ → ꮥ Ꮧ → ꮧ Ꮨ → ꮨ Ꮩ → ꮩ Ꮪ → ꮪ Ꮭ → ꮭ Ꮰ → ꮰ Ꮱ → ꮱ Ꮲ → ꮲ Ꮳ → ꮳ Ꮵ → ꮵ Ꮷ → ꮷ Ꮸ → ꮸ Ꮹ → ꮹ Ꮺ → ꮺ Ꮻ → ꮻ Ꮼ → ꮼ Ꮿ → ꮿ Ᏸ → ᏸ Ᏹ → ᏹ Ᏺ → ᏺ Ᏼ → ᏼ Affects: Cherokee [chr]

゜ → ゚ Affects: Japanese [jpn]

# Case Alternatives (multiple codepoints)

Č → č Ç → ç É → é Ó → ó Ǫ → ǫ Ṛ → ṛ ß → ss í → i̇́ İ → i̇ Ở → ở ẞ → ss Affects: Afrikaans [afr], Arabic [ara], Basque [eus], Bavarian [bar], Berber [ber], Cayuga [cay], Crimean Tatar [crh], Czech [ces], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], Galician [glg], German [deu], Hebrew [heb], Hindi [hin], Hungarian [hun], Ido [ido], Interlingua [ina], Italian [ita], Japanese [jpn], Kabyle [kab], Kölsch [ksh], Latin [lat], Lingala [lin], Lithuanian [lit], Low German (Low Saxon) [nds], Mandarin Chinese [cmn], Ottoman Turkish [ota], Polish [pol], Portuguese [por], Russian [rus], Slovenian [slv], Spanish [spa], Swabian [swg], Talossan [tzl], Talysh [tly], Tatar [tat], Toki Pona [toki], Turkish [tur], Venetian [vec], Vietnamese [vie], Zaza [zza]

ᾐ → ἠι ᾔ → ἤι ᾕ → ἥι ᾖ → ἦι ᾗ → ἧι ᾧ → ὧι ᾳ → αι ᾷ → αι ᾷ → ᾶι ῂ → ὴι ῃ → ηι ῄ → ήι ῇ → ῆι ῞ → ̔́ ῳ → ωι ῴ → ώι ῷ → ῶι Affects: Ancient Greek [grc], Greek [ell]

ﷺ → صلىاللهعليهوسلم Affects: Turkish [tur]

№ → no ™ → tm Affects: Belarusian [bel], Bulgarian [bul], English [eng], French [fra], Kazakh [kaz], Meadow Mari [mhr], Russian [rus], Spanish [spa], Tatar [tat]

¼ → 14 ½ → 12 ⅓ → 13 Affects: Danish [dan], English [eng], German [deu]

Mή → mη Ẹ̀ → ẹ̀ Άι → αϊ Άσ → ας Έί → εϊ Έι → εϊ Βή → βη Ζή → ζη Λή → λη Μή → μη Μῆ → μη Νή → νη Πή → πη Ρή → ρη Σή → ση Τὴ → τη Χή → χη Ψή → ψη άι → αϊ άσ → ας έι → εϊ έσ → ες έυ → εϋ ήµ → ημ ήι → ηϊ ίσ → ις όι → οϊ ύι → υϊ ώι → ωϊ ώσ → ως ᾷς → ᾶις ῃς → ηις ῇς → ῆις Affects: Ancient Greek [grc], Greek [ell], Yoruba [yor]

# Other Mappings Currently in Use

There's the option to go even further in unifying characters. The substitutions below are made when you search in one of the languages listed.

J → i U → v W → v j → i u → v w → v á → a é → e í → i ó → o Ā → a ā → a Ē → e ē → e ĕ → e Ī → i ī → i ĭ → i ō → o Ū → v ū → v Affects: Latin [lat]

Ά → ά · → έ Έ → ή Ή → ί Ό → ό Ύ → ώ Affects: Greek [ell]

H → ' h → ' Affects: Lojban [jbo]

ם → מ ף → פ Affects: Hebrew [heb], Yiddish [yid]

ץ → צ Affects: English [eng], Hebrew [heb], Yiddish [yid]

ן → נ Affects: English [eng], Hebrew [heb], Ladino [lad], Old Aramaic [oar], Yiddish [yid]

ļ → Ľ Affects: Latvian [lvs], Lithuanian [lit], Livonian [liv], Unknown Language

Â → a â → a î → ı û → u Affects: Turkish [tur]

ņ → Ň Affects: English [eng], Esperanto [epo], French [fra], Italian [ita], Latvian [lvs], Lithuanian [lit], Livonian [liv], Portuguese [por], Unknown Language

È → è Affects: Yoruba [yor]

ň → ŉ Affects: Czech [ces], Romani [rom], Slovak [slk], Turkmen [tuk]

ł → Ń Affects: Bavarian [bar], Belarusian [bel], Berber [ber], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], German [deu], Hungarian [hun], Indonesian [ind], Italian [ita], Kashubian [csb], Lower Sorbian [dsb], Mandarin Chinese [cmn], Navajo [nav], Polish [pol], Portuguese [por], Slovak [slk], Spanish [spa], Upper Sorbian [hsb]

I → i Affects: Azerbaijani [aze]

İ → ı Affects: Azerbaijani [aze], Crimean Tatar [crh], Dutch [nld], English [eng], Ido [ido], Ottoman Turkish [ota], Talysh [tly], Tatar [tat], Venetian [vec], Zaza [zza]

ך → כ Affects: Hebrew [heb], Old Aramaic [oar], Yiddish [yid]

ń → Ņ Affects: Belarusian [bel], Berber [ber], English [eng], Esperanto [epo], German [deu], Hungarian [hun], Lower Sorbian [dsb], Polish [pol], Slovak [slk], Spanish [spa], Upper Sorbian [hsb], Wolof [wol], Yoruba [yor]

Ơ → ơ Affects: Vietnamese [vie]

ľ → Ŀ Affects: Czech [ces], Romani [rom], Slovak [slk], Veps [vep]

ĺ → Ļ Affects: Danish [dan], Hungarian [hun], Slovak [slk], Spanish [spa]

# Punctuation and Symbols

Although most punctuation and other symbols are ignored right now, the characters below are still searchable. In the case of the dollar sign $ this is likely intentional. Maybe other currency symbols should be searchable, too.

՛ ՜ ՝ ՞ ՟ ։ Affects: Armenian [hye]

〈〉【】〔〕〜 Affects: Japanese [jpn]

『 Affects: Cantonese [yue], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn]

「」 Affects: Ainu [ain], Cantonese [yue], Japanese [jpn], Korean [kor], Literary Chinese [lzh], Mandarin Chinese [cmn], Russian [rus], Shanghainese [wuu]

、 Affects: Ainu [ain], Cantonese [yue], Italian [ita], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn], Shanghainese [wuu], Spanish [spa]

၍ Affects: Burmese [mya]

· Affects: Greek [ell]

・ Affects: English [eng], Japanese [jpn]

$ Affects: Belarusian [bel], Bengali [ben], Berber [ber], Catalan [cat], CycL [cycl], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Estonian [est], Finnish [fin], French [fra], Georgian [kat], German [deu], Greek [ell], Hebrew [heb], Hindi [hin], Ilocano [ilo], Indonesian [ind], Interlingua [ina], Italian [ita], Japanese [jpn], Kabyle [kab], Lingua Franca Nova [lfn], Maltese [mlt], Marathi [mar], Polish [pol], Portuguese [por], Romanian [ron], Russian [rus], Spanish [spa], Tagalog [tgl], Turkish [tur], Turkmen [tuk], Ukrainian [ukr]

_ Affects: Arabic [ara], Basque [eus], Belarusian [bel], Berber [ber], Bulgarian [bul], Czech [ces], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], Georgian [kat], German [deu], Hungarian [hun], Italian [ita], Japanese [jpn], Kabyle [kab], Macedonian [mkd], Mandarin Chinese [cmn], Polish [pol], Portuguese [por], Russian [rus], Serbian [srp], Spanish [spa], Swedish [swe], Tatar [tat], Turkish [tur], Uyghur [uig]

《》 Affects: Cantonese [yue], Literary Chinese [lzh], Mandarin Chinese [cmn], Shanghainese [wuu]

' Affects: Lojban [jbo]

。 Affects: Ainu [ain], Bulgarian [bul], Cantonese [yue], Chavacano [cbk], Chinese (Jin) [cjy], Gan Chinese [gan], Hakka Chinese [hak], Irish [gle], Italian [ita], Japanese [jpn], Korean [kor], Literary Chinese [lzh], Lojban [jbo], Mandarin Chinese [cmn], Min Nan Chinese [nan], Shanghainese [wuu], Sumerian [sux], Xiang Chinese [hsn]

』 Affects: Ancient Greek [grc], Cantonese [yue], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn]

(IDEOGRAPHIC SPACE) Affects: Ainu [ain], English [eng], German [deu], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn], Turkish [tur]

# Other Unsearchable Characters

The characters below currently cannot be found by searching.

ؠ ً ٌ ٍ ْ ٕ ٖ ٗ ٘ ٚ ٛ ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ٰ ۜ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ Affects: Algerian Arabic [arq], Arabic [ara], Egyptian Arabic [arz], Gulf Arabic [afb], Iraqi Arabic [acm], Kashmiri [kas], Ottoman Turkish [ota], Persian [pes], Urdu [urd]

ë ï ð ñ ċ đ ė ģ œ š ǝ ɑ ɓ ɔ ɖ ɗ ə ɛ ɡ ɣ ɤ ɨ ɪ ɲ ɾ ʁ ʊ ʋ ʌ ʒ ʔ ᵹ Affects: Afrihili [afh], Algerian Arabic [arq], Arabic [ara], Azerbaijani [aze], Bambara [bam], Berber [ber], Catalan [cat], Cayuga [cay], Choctaw [cho], Czech [ces], Dutch [nld], English [eng], Esperanto [epo], Ewe [ewe], French [fra], Ga [gaa], Galician [glg], German [deu], Hausa [hau], Hebrew [heb], Hungarian [hun], Italian [ita], Kabyle [kab], Kalmyk [xal], Kazakh [kaz], Khmer [khm], Latin [lat], Lingala [lin], Marathi [mar], Ngeq [ngt], Old English [ang], Orizaba Nahuatl [nlv], Pulaar [fuc], Russian [rus], Spanish [spa], Tachawit [shy], Tahaggart Tamahaq [thv], Talysh [tly], Tarifit [rif], Tatar [tat], Turkish [tur]

ἂ ἃ ἅ ἣ ἳ ἴ ἵ ἷ ὓ ὖ ὗ ὢ ὣ ὥ ᾱ ῑ ῡ ῥ Affects: Ancient Greek [grc]

҃ ꙗ Affects: Old East Slavic [orv]

𒀀 𒀉 𒀊 𒀕 𒀖 𒀜 𒀝 𒀠 𒀪 𒀭 𒀲 𒀳 𒀴 𒀸 𒀾 𒁀 𒁄 𒁇 𒁉 𒁍 𒁕 𒁮 𒁯 𒁲 𒁳 𒁶 𒁹 𒁺 𒁻 𒁾 𒂊 𒂍 𒂗 𒂠 𒂦 𒂵 𒂷 𒂼 𒃮 𒃲 𒃶 𒃸 𒃻 𒄀 𒄄 𒄑 𒄘 𒄠 𒄢 𒄦 𒄨 𒄩 𒄭 𒄯 𒄰 𒄴 𒄷 𒄾 𒄿 𒅁 𒅅 𒅆 𒅇 𒅍 𒅎 𒅔 𒅗 𒅘 𒅥 𒅴 𒆕 𒆗 𒆜 𒆟 𒆠 𒆪 𒆬 𒆳 𒆷 𒇇 𒇉 𒇯 𒇳 𒇴 𒇷 𒇻 𒇽 𒈜 𒈝 𒈠 𒈣 𒈤 𒈧 𒈨 𒈪 𒈫 𒈬 𒈭 𒈾 𒉆 𒉈 𒉌 𒉘 𒉡 𒉪 𒉺 𒉽 𒉿 𒊏 𒊑 𒊒 𒊕 𒊩 𒊬 𒊭 𒊮 𒊷 𒋀 𒋃 𒋗 𒋛 𒋢 𒋤 𒋧 𒋫 𒋺 𒋻 𒋼 𒋾 𒌀 𒌅 𒌆 𒌇 𒌈 𒌉 𒌋 𒌌 𒌍 𒌒 𒌓 𒌝 𒌤 𒌦 𒌨 𒌶 𒌷 𒍂 𒍇 𒍜 𒍝 𒍠 𒍢 𒍣 𒍪 𒍼 𒐈 𒐊 𒐋 𒐼 𒑂 𒑄 𒑆 𒑏 Affects: Sumerian [sux], Unknown Language

֑ ֔ ֖ ֗ ֘ ֙ ֝ ֡ ֣ ֤ ֥ ֨ ֪ ֱ ֲ ֽ Affects: Hebrew [heb]

ៗ ៝ Affects: Central Mnong [cmo], Khmer [khm]

ຽ ໆ Affects: Lao [lao]

ʻ ʼ ʿ ˀ ˈ ˌ ː Affects: Ancient Greek [grc], Belarusian [bel], Breton [bre], Cayuga [cay], English [eng], Esperanto [epo], French [fra], German [deu], Hawaiian [haw], Hebrew [heb], Italian [ita], Kabyle [kab], Navajo [nav], Ngeq [ngt], Niuean [niu], Russian [rus], Samoan [smo], Spanish [spa], Tahitian [tah], Tongan [ton], Ukrainian [ukr], Uzbek [uzb]

ൺ ൻ ർ ൽ ൾ Affects: Malayalam [mal]

ᠠ ᠨ ᠩ ᠪ ᠮ ᠰ ᡝ ᡠ ᡤ ᡥ ᡩ ᡳ ᡵ Affects: Manchu [mnc]

𑣁 𑣂 𑣅 𑣈 𑣋 𑣌 𑣓 𑣖 𑣗 𑣘 𑣙 𑣜 Affects: Ho [hoc]

𐰀 𐰃 𐰆 𐰇 𐰉 𐰋 𐰍 𐰓 𐰕 𐰖 𐰘 𐰚 𐰞 𐰢 𐰣 𐰲 𐰸 𐰺 𐰼 𐰾 𐱃 𐱅 Affects: Old Turkish [otk]

𐌰 𐌱 𐌲 𐌳 𐌴 𐌵 𐌶 𐌷 𐌸 𐌹 𐌺 𐌻 𐌼 𐌽 𐌾 𐌿 𐍀 𐍂 𐍃 𐍄 𐍅 𐍆 𐍈 𐍉 Affects: Gothic [got]

ꦁ ꦂ ꦃ ꦏ ꦒ ꦔ ꦕ ꦗ ꦚ ꦠ ꦡ ꦢ ꦣ ꦤ ꦥ ꦧ ꦩ ꦪ ꦫ ꦭ ꦮ ꦰ ꦱ ꦲ ꦴ ꦶ ꦸ ꦺ ꦼ ꧀ Affects: Javanese [jav]

ꀁ ꀃ ꀐ ꀕ ꁧ ꂘ ꂯ ꃀ ꆍ ꆏ ꆹ ꇩ ꇬ ꇿ ꈍ ꉡ ꉬ ꊿ ꋋ ꋙ ꋠ ꌕ ꍏ ꏃ ꐥ ꑋ ꑍ ꑬ Affects: Unknown Language

ㇰㇱㇷㇻㇼㇽㇾㇿ Affects: Ainu [ain]

︎ Affects: Japanese [jpn]

# Ignored Intentionally

Some of these characters are ignored intentionally. The difference to simply being unknown is that e.g. a word with ignored Hebrew vowel marks like "שִׁבְטְךָ֥" is treated like a single word, while a word with unknown Arabic vowel marks like "گیوٗر" is split into parts.

(SOFT HYPHEN) ́ Affects: Latin [lat], Russian [rus]

(SOFT HYPHEN) ְ ֱ ֲ ֳ ִ ֵ ֶ ַ ָ ֹ ֺ ֻ ּ ֽ ־ ֿ ׀ ׁ ׂ ׃ ׄ ׅ ׇ Affects: All Languages [all]

hide replies show replies

AlanF_US October 4, 2019 October 4, 2019 at 3:50:52 PM UTC

link

Permalink

Yorwba, it's wonderful that you took the time to look into this. Three requests:

(1) Please transfer this to an issue ticket in GitHub.
(2) When you do transfer it to an issue ticket, if you could come up with some way to list the (numeric) code points in addition to the characters, that would be great.
(3) Also, when you transfer it to an issue ticket, if you have some time, could you please do some analysis on how this would affect the sentences that are actually in our corpus? In many cases, I'm guessing that our users are actually not using any of the character variants you list.

Many thanks!

hide replies show replies

Yorwba October 4, 2019 October 4, 2019 at 4:23:22 PM UTC

link

Permalink

(1) I was thinking about multiple issues for each script/group of characters where a problem is identified, but I guess one mega-issue with checkboxes to keep track of incremental progress would also work. FWIW I also plan to work on PRs for the more obvious cases as soon as my VM finishes provisioning, which should be any day now...

(2) Obviously I have all this data in a format that makes it easier to identify the specific codepoints involved. I didn't include that info here because it's probably not that relevant to the community at large.

(3) Each of those characters is used at least once in at least one of the languages listed on the same line. I can count the number of sentences in each case as well, if that's necessary for prioritization.

Yorwba October 5, 2019 October 5, 2019 at 12:04:42 AM UTC

link

Permalink

GitHub Issue is here: https://github.com/Tatoeba/tatoeba2/issues/1970

Thanuir October 4, 2019 October 4, 2019 at 5:52:46 PM UTC

link

Permalink

The different encodings should be treated as the same. I write Finnish with a non-native keyboard, so I combine ¨ and a to get ä, and it would be strange if this was not treated the same as the usual ä.

> ª → a º → o

To me, the first ones appear as super-indices and the second as usual. If so, I am not sure about what would be the best way to treat these. Some examples would be helpful.

Numbers in powers have a different meaning then other numbers in mathematics, but this is not terribly relevant for Tatoeba. Same with other upper and lower indices.

> ¼ → 14 ½ → 12 ⅓ → 13

Would not ¼ → 1/4 ½ → 1/2 ⅓ → 1/3 be better? If slash is recognized, that is.

Euro symbol should have the same status as dollar symbol and other (internationally recognized) currency symbols.

hide replies show replies

Yorwba October 4, 2019 October 4, 2019 at 6:24:48 PM UTC

link

Permalink

Maybe your keyboard is actually smart enough to directly combine the individual codepoints. However, some people are certainly entering them separately.

> ª → a º → o

Those are used in abbreviations, e.g. in Portuguese #1014908 . As for Finnish, it appears *someone* confused superscript o and 0 when writing #7992705 .

> Some examples would be helpful.

I would have liked to include some, but the post is bloated enough as it is.

> Would not ¼ → 1/4 ½ → 1/2 ⅓ → 1/3 be better? If slash is recognized, that is.

Yes. The method I used to come up with these pairings is not perfect, and I was hoping to get this kind of suggestion. Slash is not recognized, I think, but searching for "1/3" would still find "1/3" in a sentence. (As well as "1.3")

hide replies show replies

Thanuir October 4, 2019, edited October 4, 2019 October 4, 2019 at 6:27:49 PM UTC, edited October 4, 2019 at 6:31:21 PM UTC

link

Permalink

(The power in that one sentence is copy-pasted from the English one, where I tried asking what the notation means and was told that it is a zero. I did not know it was the letter o. I'll fix that.

Or, rather, I would fix it if I could figure out how to write superscripts or how to copy-paste them without them turning into normal sized numbers.)

hide replies show replies

Yorwba October 5, 2019 October 5, 2019 at 12:08:25 AM UTC

link

Permalink

Hm, evidently you were able to copy-paste superscript characters before. Does the superscript zero from my post (² → 2 ³ → 3 ¹ → 1 ⁰ → 0 ⁸ → 8 ⁿ → n) work?

hide replies show replies

Thanuir October 5, 2019 October 5, 2019 at 8:19:20 AM UTC

link

Permalink

That seems to work, yes. I edited the sentences and suggested it to be changed in some of the linked ones. Thanks.

Menu

Need some help?

Developers

About