clear
{{language.name}} No language found.
swap_horiz
{{language.name}} No language found.
search

Wall (5,553 threads)

MisterTrouser
7 days ago
Feature suggestion:
Mark sentences as "not directly linked".

Reason / Why does it help?
Short: Working together on finding directly linkable sentences gets enabled.

If someone tries to find indirectly linked sentences, that could be directly linked, he/she can use this search:
https://tatoeba.org/eng/sentenc...io=&sort=words

He/She reads through all 65 pages of sentences and figured, all indirectly linked sentences are not directly linkable.

Now a second user wants to do the same: Trying to directly link indirectly linked sentences. This second user will also have to read through all 65 pages of sentences to figure, there is nothing to do.
hide replies
TRANG
3 days ago
CK opened a GitHub issue for this:
https://github.com/Tatoeba/tatoeba2/issues/1980
MisterTrouser
7 days ago
Bug report for linking sentences:
Reproduce:
(1) The following link shows Japanese sentences with indirect German links: https://tatoeba.org/eng/sentenc...io=&sort=words
Open it.
(2) Link an indirectly linked sentence.
(3) <sentence gets linked>
(4)
Expected: <Except of the fact, that the sentence is now linked nothing should've changed.>
Actual: < All other languages for that sentence also get shown >
hide replies
TRANG
3 days ago
This has been reported already in GitHub:
https://github.com/Tatoeba/tatoeba2/issues/1783
AlanF_US
2019-08-23 02:42 - 2019-08-23 02:49
Search is finding sentences that do not contain the specified word. This was reported 24 days ago (see link 1 below), and @gillux made a change that he thought might have fixed the problem, but apparently it didn't. For instance, a general search for "Tom" in English (see link 2 below) brings up these sentences:

Damn.
#1078143

Beat it.
#37902

Damn you!
#1135061

We talked.
#2107672

How unfortunate!
#2111810

These sentences are owned by a variety of people, and the logs don't reveal anything that suggests that they ever contained the word "Tom". Nor do all the sentences contain a tag, or audio. However, they are all short. If I set the sort order to random or to longest sentences first, I don't see any false hits.

Could it be that the search engine has seen so many sentences with "Tom" that it now hallucinates them even in sentences that don't contain the word?

I reported this problem on GitHub as well:

https://github.com/Tatoeba/tatoeba2/issues/1944

Link 1: https://tatoeba.org/eng/wall/sh...#message_32265

Link 2: https://tatoeba.org/eng/sentenc...io=&sort=words

hide replies
Thanuir
2019-08-23 10:43
I had some similar issues within a day.
deniko
2019-08-23 14:44
Yeah, seems like the same problem resurfaced.

"Where is the butter?"

https://i.imgur.com/846O0b5.png
hide replies
TRANG
2019-08-23 21:07
I just searched again for "darn" as reported by brauchinet in the previous thread:
https://tatoeba.org/eng/sentenc...rom=eng&to=und

There are 12 results and 2 are incorrect:

#1202147 - I'm about to die.
#690834 - There are no comments for now.

Do you remember if those were the incorrect sentences when you also checked for this search?
(cf. https://tatoeba.org/eng/wall/sh...message_32269)
hide replies
brauchinet
2019-08-24 06:25 - 2019-08-24 08:12
I'm sure that these were not the incorrect results I found the first time.

I noticed that when I use words from wrong search results I get wrong results again.
For example, deniko's "what the fuck" -> "Where is the butter?"

Take "butter":
https://tatoeba.org/eng/sentenc...rom=eng&to=und
Many errors and even "Get the fuck out!"

Edit:
I don’t know if this is of any help:
The “darn” search finds 12 results. I did a search using the downloadable data.
The ones not found by the search engine (that is: wrongly displayed) are:
#643316 Darn!
#1359478 Well I'll be darned!
TRANG
2019-08-23 21:09
To everyone, please report here every strange search results that you find. Posting the URL of the search is enough.

I really have no idea what is causing this right now. We need to find clues on how to reproduce it systematically.
hide replies
cojiluc
2019-08-24 07:37
The search below produces a wrong result, since the searched term is nonexistent in the result.

https://tatoeba.org/eng/sentenc...ery=histrionic
hide replies
brauchinet
2019-08-24 07:54
This search should have found:
#8110003 He became histrionic upon hearing the news.
AlanF_US
2019-08-24 19:19
Trang, I think that your complete reindexing ( https://github.com/Tatoeba/tato...ment-524564520 ) solved the problem. When I search for such words as "Tom", "darn", "histrionic", "away", or "butter", I only see good results now. Thank you!
hide replies
TRANG
2019-08-24 19:28
Just a warning: this may solve the issue only for the short term. We have not identified the cause yet.

At least we know that the issue does not happen in the indexation of the main indexes. But something might be going wrong in the delta indexes, or during the merge of the delta indexes into the main indexes...

We will have to see if it happens again.
hide replies
brauchinet
2019-08-26 11:45 - 2019-08-26 12:00
Sorry to say it has happened again:

https://tatoeba.org/eng/sentenc...&=Any+language

Most of (all of?) the wrong results are recently added sentences.

Search: The audience gasped
https://tatoeba.org/eng/sentenc...&=Any+language
A driver was blocking the intersection.

Search:
https://tatoeba.org/eng/sentenc...&=Any+language
This city suffers from gridlock.

and so on.
hide replies
TRANG
2019-08-26 14:55
Thanks for reporting.

I updated the GitHub issue:
https://github.com/Tatoeba/tatoeba2/issues/1944

Still haven't identified the cause...
hide replies
brauchinet
2019-08-26 18:09 - 2019-08-26 19:10
One more observation:

take a wrong search result (for example: totally lost):
https://tatoeba.org/eng/sentenc...rom=eng&to=und

#8130253 Tom read a book with his son.
go back 5 English sentences (5 times "previous" with language=eng)
and you get the correct result:
#8130246 Tom was totally lost.

I tried quite a few, it always worked.

(Well, it only works with the most recent sentences - it doesn't with older ones such as:
https://tatoeba.org/eng/sentenc...rom=eng&to=und
Kiev is the capital of Ukraine.)

hide replies
AlanF_US
2019-08-27 15:20
Just curious: How did you figure that out?
hide replies
brauchinet
2019-08-27 17:42
Starting from one incorrect result sentence (eg. Tom was totally lost) I got a chain of sentences:
#8130253 Tom read a book with his son
#8130259 He would always break his promise
#8130357 Tom was confused by what had happened
#8130381 Tom thought the attraction was mutual

Obviously the sentence numbers are increasing, but, to my disappointment, not by equal steps. This brought me to the idea that maybe only English sentences count.
soliloquist
2019-08-26 20:23
Searching for the word 'origami' yields #8130386 (Tom reviewed his notes.)

https://tatoeba.org/eng/sentenc...sort=relevance
hide replies
AlanF_US
2019-08-26 20:33
That instance follows the "brauchinet rule": pressing the "previous" button five times with the language set to "eng" gets you back to a sentence that contains the search word (in this case, #8130378 ).
hide replies
soliloquist
2019-08-26 20:40
I see. Thanks.
TRANG
2019-08-27 19:05
This search seems to have been solved by itself. I do not see "Tom reviewed his notes" in the results...
hide replies
soliloquist
2019-08-27 19:18
I added translations for that sentence after reporting the issue on the Wall. When I rechecked the search link after several minutes, the irrelevant sentence was gone. Could they be connected? Maybe translating a misindexed sentence triggered something.
hide replies
brauchinet
7 days ago
This because "?" is treated as a one-letter-wildcard.
like in: https://tatoeba.org/eng/sentenc...rom=por&to=und
hide replies
AlanF_US
6 days ago
Exactly. On this wiki page:

https://en.wiki.tatoeba.org/art...w/text-search#

you will find the following:

"Leave punctuation out of your search string. Most punctuation will be ignored, but a final exclamation mark (!) or question mark (?) will actually interfere with the search."
Thanuir
7 days ago
*Mathematics tag unification, experiences from doing it by hand*

TLDR: Tag maintenance is, at the moment, a grueling and thankless job. I recommend not doing it without further tools, except as the opportunity presents itself while going about your other business.

TLDR: Please add the "mathematics" tag to relevant sentences. Please do not add and please do remove the "maths" tag.

Background: I find translations of mathematics sentences helpful. At the moment, there are many untagged sentences, as well as the following tags (and probably others in other languages):

mathematics, 1676 sentences: https://tatoeba.org/deu/tags/sh..._with_tag/1938
maths, 432: https://tatoeba.org/deu/tags/sh...ces_with_tag/2
mathématique, 128: https://tatoeba.org/deu/tags/sh..._with_tag/5118
Mathematik, 70: https://tatoeba.org/deu/tags/sh..._with_tag/9199
matematik, 3: https://tatoeba.org/deu/tags/sh...with_tag/10262

I went through all the sentences with "maths" tag and added the "mathematics" tag. Probably I missed a few and mistyped a few. I did not remove the "maths" tag, as I do not have the power to do that.

Suggestion: Please use/add the "mathematics" tag to relevant sentences. The "maths" tag has exactly the same meaning, less sentences (especially now), and is less regionally neutral; it seems to used in UK, whereas in North America "math" seems to be preferred. Having two different tags makes it trickier to find relevant sentences.
I also recommend adding the "mathematics" tag in addition to any other non-English tag, as per Tatoeba tag guidelines, English tags should be used.

My workflow was to have a list of sentences with "maths" and not "mathematics", copy a sentence number, switch to browser, select the address bar, highlight the sentence number, replace with the new one, enter. Click on the text box for writing tags (possibly after moving the mouse, because the box is in different place based on how many and how long tags the sentence already has) and write "mathematics". Enter. While the page is loading, fetch the next sentence number. Sometimes translate the sentence or open linked sentences to also tag those.

This is very inefficient for a human to do. It requires using mouse (or careful tabbing, I guess) in addition to the keyboard, and manually writing "mathematics", since the sentence numbers occupy the clipboard.

Some possibilities for messing with tags on a large number of sentences. I think any of these should require a conversation and consensus on the wall, pretty much.

1. Manual ad-hoc solution by admins. Not sustainable or a good idea.

2. Separate tag management user interface, available to users of high privilege, with possibility of declaring tags as synonyms, deleting tags, replacing a tag with another, etc. Would require a lot of work to implement, I guess (but do not really know).

3. A bot that can do the above. A user would operate it. Since tags move slowly, the bot could be run on fixed and slow schedule, or just be operated "by hand" to do a given operation on tags. It should leave a comment so that misbehaviour would be easier to detect. I have no idea how demanding this would be to implement.

4. Integrating tags with the search engine more fully would make tag management a bit more convenient in terms of workflow, and also increase the value of tags. This would mean the ability to restrict a search to a tag or a set of tags, exclude tags, etc. No idea about the difficult of implementation.

Suggestions 2 and 3 could be combined, I guess - a user interface could create a list of actions on tags, which the bot would then implement as its own pace.
orion17
17 days ago
Hello everyone, I'm back!
It's been years since I became a member. I still wonder if Tatoeba is now available in Android because today many people here use smartphone more often than computer. Still, for me the interface of this site isn't so userfriendly in web browser, not to mention that my slow internet speed here in the place I live is frustrating. I hope I can find new way to continue contributing Indonesian and Sundanese sentences effortlessly.

Thanks.
hide replies
Ricardo14
17 days ago
Hi there and welcome back!

Well, people here worked on making Tatoeba running smoothly on smartphones. I do feel the difference.

Perhaps you want to give *Opera Mini* a try. It's "quickest" one I found.

For apps, there's an *unofficial*- one. Trang made it clear.

Hope I've helped :)
hide replies
orion17
9 days ago
Thanks for replying :)
You're right. Opera mini is so quick. However some interface may look different and the audio doesn't work there.
hide replies
Ricardo14
9 days ago
Yes, that's a problem but I've got used to this. I mean, whenever I want to use the "basics" of Tatoeba, I use Opera Mini.

Firefox may works properly as well (! I haven't tested it that much yet). :)
Guybrush88
8 days ago
My suggestion is report it on the bug tracker with a bit of details (i.e.: the os version, the browser version, what happens): https://github.com/Tatoeba/tatoeba2/issues

so devs can fix that issue sooner or later
AlanF_US
8 days ago - 7 days ago
In late August, Trang announced that we had received a grant for a responsive UI project. As she explained:

"'Responsive UI' means that the content of the website will adapt to the size of the screen. The end goal is to make Tatoeba easier to use from a mobile device."

Here's the thread:

https://tatoeba.org/eng/wall/sh...#message_32489
CK
CK
8 days ago - 7 days ago
** Just for Fun **

http://bit.ly/rndengaudio

1. This will jump to a randomly-selected English sentence that had no linked translations on 2019-10-05.

2. When clicking the "Next Random Page" button, Chrome and Opera, at least on a Mac, will start playing the audio right away, often before the page has finished loading. Firefox didn't autoplay the audio for me.


Here are some variations.

http://bit.ly/rndengaudio2 (33,468 sentences with audio)
This does the same as above, but all the sentences have 16 or more direct links.

http://bit.ly/rndengaudio3 (27,735 sentences with audio)
This one has sentences with 11 to 15 direct links.

http://bit.ly/rndengaudio4 (9,692 sentences with audio)
Sentences over #8,000,000

http://bit.ly/rndengaudio5 (43,967 sentences with audio)
Sentences from #7,000,000 to #7,999,999


The one at the top of the page has 22,320 sentences.

The last 2 limited to higher sentences numbers also are limited to newer audio, so they may be better sounding.
CK
CK
9 days ago
** List 907 now contains 728,618 sentences

Perhaps you would like to browse it to see the latest additions, and perhaps add translations into your own language.

http://tatoeba.org/eng/sentence...s/show/907/und
Proofread Good English Sentences That CK Uses on His Projects

It's also available for use in the advanced search.

hide replies
CK
CK
8 days ago
I've also uploaded the audio for most of last week's new additions to List 907.
http://tatoeba.org/eng/sentence...s/show/907/und

The new sentences without audio are ones that would sound better with a female voice, or were ones that I recorded but threw away because the audio had problems.
Yorwba
10 days ago - 7 days ago
Dear fellow Tatoebans!

As you probably know, the search engine on Tatoeba treats some characters specially. For example, the search results for ="tom", ="Tom", ="ToM" and so on are the same, and searching for ="?" doesn't find anything.

However, recently it was discovered ( https://tatoeba.org/eng/wall/sh...#message_32566 ) that the German ß was not treated correctly. Since such a common character in the sixth-most used language was affected, I suspected that there might be similar problems with characters in other languages and decided to investigate. As it turns out, there are quite a lot!

Some background on the search engine's inner workings: when someone searches for a word like "ToM", a list of mappings T → t, o → o, M → m is used to transform it into "tom" and that is the word that will really be searched for. Any unknown characters are removed. Using this system, it is not possible to treat for example Latin "A", Greek "Α" and Cyrillic "А" as identical without also doing the same for their lower-case equivalents "a", "α" and "а". That would mean Greek search results for words using the Latin alphabet, which may not be desirable.

So when deciding which characters to make the same, it's necessary to make a tradeoff between finding more results with variant characters and still being able to distinguish distinct words that only differ in which variant they use. Fortunately, it's possible to make a different decision for each language.

Below I have listed characters for which changes may need to be made and the affected languages with sentences containing those characters. I hope everyone reading this can take some time looking at the languages they speak and consider what the best decision in each case is.

Disclaimer: I do not speak for the Tatoeba team and there's no guarantee anything will happen as a result of this post.

# Duplicate Encodings

For historical reasons, there are some identical characters that have multiple computer codes to represent them, but there shouldn't be any need to distinguish them.

ά → ά έ → έ ή → ή ί → ί ό → ό ύ → ύ ώ → ώ Affects: Ancient Greek [grc]

不 → 不 粒 → 粒 行 → 行 Affects: Cantonese [yue], Literary Chinese [lzh]



# Duplicate Encodings (multiple codepoints)

I'm listing characters involving multiple codepoints separately, because they require larger changes to Tatoeba's search engine. What is a codepoint? Think about it like the keys you press on a keyboard to type a character like "à": the key for the accent and the key for "a". So "à" consists of two codepoints. But there's also "à" with a single codepoint, like a keyboard with a special key to type "à" directly.

à → à á → á â → â ã → ã ä → ä ả → ả å → å ạ → ạ ć → ć ĉ → ĉ ç → ç è → è é → é ê → ê ẹ → ẹ ę → ę ĝ → ĝ ḥ → ḥ ì → ì í → í ỉ → ỉ ị → ị ĵ → ĵ ň → ň ò → ò ó → ó õ → õ ö → ö ỏ → ỏ ọ → ọ ǫ → ǫ ṛ → ṛ ŝ → ŝ ṣ → ṣ ş → ş ṭ → ṭ ù → ù ú → ú ũ → ũ ŭ → ŭ ü → ü ủ → ủ ụ → ụ ý → ý ẓ → ẓ ầ → ầ ấ → ấ ẫ → ẫ ậ → ậ ề → ề ế → ế ễ → ễ ệ → ệ ố → ố ỗ → ỗ ổ → ổ ằ → ằ ắ → ắ ẳ → ẳ ặ → ặ ờ → ờ ớ → ớ ở → ở ợ → ợ ừ → ừ ứ → ứ ữ → ữ ử → ử ự → ự Affects: Berber [ber], Cayuga [cay], Esperanto [epo], Finnish [fin], French [fra], Hungarian [hun], Interlingue [ile], Italian [ita], Kabyle [kab], Lingala [lin], Navajo [nav], Russian [rus], Serbian [srp], Shuswap [shs], Spanish [spa], Swedish [swe], Tatar [tat], Turkish [tur], Turkmen [tuk], Vietnamese [vie], Yoruba [yor]

й → й Affects: Bashkir [bak]

آ → آ أ → أ ؤ → ؤ Affects: Arabic [ara], Persian [pes], Urdu [urd]

ऱ → ऱ क़ → क़ ख़ → ख़ ग़ → ग़ ज़ → ज़ ड़ → ड़ ढ़ → ढ़ फ़ → फ़ Affects: Garhwali [gbm], Hindi [hin], Marathi [mar]

ড় → ড় ঢ় → ঢ় য় → য় Affects: Assamese [asm], Bengali [ben]

ਸ਼ → ਸ਼ ਖ਼ → ਖ਼ ਗ਼ → ਗ਼ ਜ਼ → ਜ਼ ਫ਼ → ਫ਼ Affects: Punjabi (Eastern) [pan]

ோ → ோ Affects: Tamil [tam]

ೀ → ೀ ೊ → ೊ ೋ → ೋ ೇ → ೇ Affects: Kannada [kan]

ോ → ോ Affects: Malayalam [mal]

יִ → יִ ײַ → ײַ שׂ → שׂ אַ → אַ אָ → אָ וּ → וּ כּ → כּ פּ → פּ תּ → תּ בֿ → בֿ כֿ → כֿ פֿ → פֿ Affects: Hebrew [heb], Yiddish [yid]

ָֹ → ָֹ ְּ → ְּ ֳּ → ֳּ ִּ → ִּ ֵּ → ֵּ ֶּ → ֶּ ַּ → ַּ ָּ → ָּ ֹּ → ֹּ ֻּ → ֻּ ְׁ → ְׁ ִׁ → ִׁ ֶׁ → ֶׁ ַׁ → ַׁ ָׁ → ָׁ ֹׁ → ֹׁ ֻׁ → ֻׁ ְׂ → ְׂ ִׂ → ִׂ ֵׂ → ֵׂ ָׂ → ָׂ ֹׂ → ֹׂ َّ → َّ ُّ → ُّ ِّ → ِّ ़् → ़् ့် → ့် Affects: Algerian Arabic [arq], Arabic [ara], Burmese [mya], Hebrew [heb], Hindi [hin], North Levantine Arabic [apc], Persian [pes], Yiddish [yid]



# Near Duplicates

There are some characters which usually look slightly different, but can be used for the same purpose in many situations. The question is whether searching for them on Tatoeba is one of those situations.

ª → a º → o Affects: Danish [dan], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Interlingua [ina], Italian [ita], Japanese [jpn], Lingua Franca Nova [lfn], Portuguese [por], Russian [rus], Spanish [spa], Turkish [tur], Ukrainian [ukr]

² → 2 ³ → 3 ¹ → 1 ⁰ → 0 ⁸ → 8 ⁿ → n Affects: Basque [eus], Choctaw [cho], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Hebrew [heb], Hungarian [hun], Interlingua [ina], Irish [gle], Italian [ita], Japanese [jpn], Mandarin Chinese [cmn], Polish [pol], Portuguese [por], Russian [rus], Shanghainese [wuu], Spanish [spa], Turkish [tur], Ukrainian [ukr]

₁ → 1 ₂ → 2 ₙ → n Affects: Basque [eus], Czech [ces], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Hungarian [hun], Interlingua [ina], Italian [ita], Japanese [jpn], Kabyle [kab], Macedonian [mkd], Marathi [mar], Portuguese [por], Russian [rus], Spanish [spa], Turkish [tur], Ukrainian [ukr], Vietnamese [vie]

① → 1 ② → 2 Affects: Japanese [jpn]

𝑎 → a 𝑏 → b 𝑐 → c 𝑒 → e 𝑖 → i 𝑘 → k 𝑚 → m 𝑛 → n 𝑟 → r 𝑥 → x 𝑦 → y 𝘨 → g 𝜀 → ε 𝜋 → π Affects: Esperanto [epo], German [deu], Russian [rus], Spanish [spa]

ℎ → h Affects: German [deu]

ϕ → φ ℵ → א Affects: Ancient Greek [grc], German [deu]

ʰ → h ʷ → w ᵉ → e ᵗ → t ⵯ → ⵡ Affects: Berber [ber], English [eng], French [fra], Kabyle [kab], Khmer [khm], Ngeq [ngt], Waray [war]

ſ → s Affects: Middle French [frm]

ﮐ → ک ﺋ → ئ ﺎ → ا ﺣ → ح ﺹ → ص ﻊ → ع ﻋ → ع ﻞ → ل ﻠ → ل ﻣ → م ﻪ → ه Affects: Ottoman Turkish [ota]

⺟ → 母 ⼀ → 一 ⾯ → 面 ⾷ → 食 Affects: Min Nan Chinese [nan]



# Near Duplicates (multiple codepoints)


ij → ij և → եւ fi → fi ﻹ → لإ ﻻ → لا ﻼ → لا Affects: Arabic [ara], Armenian [hye], Dutch [nld], Irish [gle], Ottoman Turkish [ota]

㌔ → キロ ㌘ → グラム Affects: Japanese [jpn]

ำ → ํา Affects: Thai [tha]

ໜ → ຫນ ໝ → ຫມ Affects: Lao [lao]



# Case Alternatives

Some characters can look very different when changing between upper case and lower case, but they should probably still be treated the same when searching.

H → h I → ı J → j U → u W → w Á → á Â → â Ä → ä Å → å É → é Ú → ú Ā → ā Č → č Ē → ē Ġ → ġ Ĥ → ĥ Ī → ī İ → i ı → i Ĵ → ĵ Ļ → ļ Ľ → ľ Ł → ł Ņ → ņ Ŝ → ŝ Ū → ū ℂ → c ℃ → c ℕ → n ℝ → r Ꞌ → ꞌ 𝐴 → a 𝐵 → b 𝐾 → k 𝑁 → n 𝑋 → x Affects: Azerbaijani [aze], Bashkir [bak], Berber [ber], Chamorro [cha], Chuvash [chv], Crimean Tatar [crh], Croatian [hrv], Czech [ces], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Greek [ell], Hungarian [hun], Ido [ido], Italian [ita], Japanese [jpn], Kashmiri [kas], Kashubian [csb], Latin [lat], Latvian [lvs], Lojban [jbo], Lower Sorbian [dsb], Navajo [nav], Old East Slavic [orv], Ottoman Turkish [ota], Polish [pol], Portuguese [por], Russian [rus], Slovak [slk], Spanish [spa], Talysh [tly], Tatar [tat], Turkish [tur], Turkmen [tuk], Unknown Language, Upper Sorbian [hsb], Zaza [zza]

Ԑ → ԑ Affects: Kabyle [kab]

¨ → ̈ ´ → ́ ˙ → ̇ ˚ → ̊ Affects: Ancient Greek [grc], Berber [ber], Catalan [cat], Czech [ces], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], German [deu], Greek [ell], Guarani [grn], Italian [ita], Low German (Low Saxon) [nds], Mandarin Chinese [cmn], Occitan [oci], Old Tupi [tpw], Portuguese [por], Slovak [slk], Spanish [spa], Turkish [tur], Ukrainian [ukr]

𑢩 → 𑣉 𑢮 → 𑣎 𑢯 → 𑣏 Affects: Ho [hoc]

ͅ → ι ΄ → ́ Ά → α Έ → ε Ή → ή Ί → ι Ό → ο Ύ → υ Ώ → ω ΐ → ϊ ά → α έ → ε ί → ι ς → σ ό → ο ύ → υ ώ → ω ἀ → α ἁ → α ἄ → α Ἀ → ἀ Ἄ → α Ἄ → ἄ Ἆ → ἆ ἐ → ε ἔ → ε ἕ → ε Ἐ → ἐ Ἑ → ἑ Ἓ → ε Ἓ → ἓ Ἔ → ε Ἔ → ἔ ἠ → η ἡ → η ἦ → ή Ἡ → ἡ Ἢ → ἢ Ἥ → ἥ Ἦ → ἦ ἰ → ι ἱ → ι ἶ → ι Ἰ → ἰ Ἱ → ἱ ὁ → ο ὅ → ο Ὀ → ὀ Ὁ → ο Ὁ → ὁ Ὃ → ὃ Ὄ → ὄ Ὅ → ὅ ὐ → υ ὔ → υ Ὑ → ὑ Ὕ → ὕ ὠ → ω ὡ → ω Ὡ → ὡ Ὤ → ὤ Ὦ → ὦ ὰ → α ὲ → ε έ → ε ὴ → ή ὶ → ι ί → ι ὸ → ο ὺ → υ ὼ → ω ώ → ω ᾶ → α ᾽ → ̓ ᾿ → ̓ ῆ → ή ῖ → ι ῦ → υ ῶ → ω ῾ → ̔ Affects: Ancient Greek [grc], Greek [ell], Portuguese [por]

Ա → ա Բ → բ Գ → գ Դ → դ Ե → ե Զ → զ Է → է Ը → ը Թ → թ Ժ → ժ Ի → ի Լ → լ Խ → խ Ծ → ծ Կ → կ Հ → հ Ձ → ձ Ղ → ղ Ճ → ճ Մ → մ Յ → յ Ն → ն Շ → շ Ո → ո Չ → չ Պ → պ Ջ → ջ Ս → ս Վ → վ Տ → տ Ց → ց Ւ → ւ Փ → փ Ք → ք Օ → օ Ֆ → ֆ Affects: Armenian [hye]

Ꭰ → ꭰ Ꭱ → ꭱ Ꭴ → ꭴ Ꭶ → ꭶ Ꭷ → ꭷ Ꭸ → ꭸ Ꭹ → ꭹ Ꭺ → ꭺ Ꭼ → ꭼ Ꭽ → ꭽ Ꭿ → ꭿ Ꮂ → ꮂ Ꮃ → ꮃ Ꮅ → ꮅ Ꮆ → ꮆ Ꮈ → ꮈ Ꮎ → ꮎ Ꮑ → ꮑ Ꮒ → ꮒ Ꮓ → ꮓ Ꮕ → ꮕ Ꮖ → ꮖ Ꮗ → ꮗ Ꮙ → ꮙ Ꮛ → ꮛ Ꮜ → ꮜ Ꮝ → ꮝ Ꮟ → ꮟ Ꮡ → ꮡ Ꮢ → ꮢ Ꮣ → ꮣ Ꮤ → ꮤ Ꮥ → ꮥ Ꮧ → ꮧ Ꮨ → ꮨ Ꮩ → ꮩ Ꮪ → ꮪ Ꮭ → ꮭ Ꮰ → ꮰ Ꮱ → ꮱ Ꮲ → ꮲ Ꮳ → ꮳ Ꮵ → ꮵ Ꮷ → ꮷ Ꮸ → ꮸ Ꮹ → ꮹ Ꮺ → ꮺ Ꮻ → ꮻ Ꮼ → ꮼ Ꮿ → ꮿ Ᏸ → ᏸ Ᏹ → ᏹ Ᏺ → ᏺ Ᏼ → ᏼ Affects: Cherokee [chr]

゜ → ゚ Affects: Japanese [jpn]



# Case Alternatives (multiple codepoints)


Č → č Ç → ç É → é Ó → ó Ǫ → ǫ Ṛ → ṛ ß → ss í → i̇́ İ → i̇ Ở → ở ẞ → ss Affects: Afrikaans [afr], Arabic [ara], Basque [eus], Bavarian [bar], Berber [ber], Cayuga [cay], Crimean Tatar [crh], Czech [ces], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], Galician [glg], German [deu], Hebrew [heb], Hindi [hin], Hungarian [hun], Ido [ido], Interlingua [ina], Italian [ita], Japanese [jpn], Kabyle [kab], Kölsch [ksh], Latin [lat], Lingala [lin], Lithuanian [lit], Low German (Low Saxon) [nds], Mandarin Chinese [cmn], Ottoman Turkish [ota], Polish [pol], Portuguese [por], Russian [rus], Slovenian [slv], Spanish [spa], Swabian [swg], Talossan [tzl], Talysh [tly], Tatar [tat], Toki Pona [toki], Turkish [tur], Venetian [vec], Vietnamese [vie], Zaza [zza]

ᾐ → ἠι ᾔ → ἤι ᾕ → ἥι ᾖ → ἦι ᾗ → ἧι ᾧ → ὧι ᾳ → αι ᾷ → αι ᾷ → ᾶι ῂ → ὴι ῃ → ηι ῄ → ήι ῇ → ῆι ῞ → ̔́ ῳ → ωι ῴ → ώι ῷ → ῶι Affects: Ancient Greek [grc], Greek [ell]

ﷺ → صلىاللهعليهوسلم Affects: Turkish [tur]

№ → no ™ → tm Affects: Belarusian [bel], Bulgarian [bul], English [eng], French [fra], Kazakh [kaz], Meadow Mari [mhr], Russian [rus], Spanish [spa], Tatar [tat]

¼ → 14 ½ → 12 ⅓ → 13 Affects: Danish [dan], English [eng], German [deu]

Mή → mη Ẹ̀ → ẹ̀ Άι → αϊ Άσ → ας Έί → εϊ Έι → εϊ Βή → βη Ζή → ζη Λή → λη Μή → μη Μῆ → μη Νή → νη Πή → πη Ρή → ρη Σή → ση Τὴ → τη Χή → χη Ψή → ψη άι → αϊ άσ → ας έι → εϊ έσ → ες έυ → εϋ ήµ → ημ ήι → ηϊ ίσ → ις όι → οϊ ύι → υϊ ώι → ωϊ ώσ → ως ᾷς → ᾶις ῃς → ηις ῇς → ῆις Affects: Ancient Greek [grc], Greek [ell], Yoruba [yor]



# Other Mappings Currently in Use

There's the option to go even further in unifying characters. The substitutions below are made when you search in one of the languages listed.

J → i U → v W → v j → i u → v w → v á → a é → e í → i ó → o Ā → a ā → a Ē → e ē → e ĕ → e Ī → i ī → i ĭ → i ō → o Ū → v ū → v Affects: Latin [lat]

Ά → ά · → έ Έ → ή Ή → ί Ό → ό Ύ → ώ Affects: Greek [ell]

H → ' h → ' Affects: Lojban [jbo]

ם → מ ף → פ Affects: Hebrew [heb], Yiddish [yid]

ץ → צ Affects: English [eng], Hebrew [heb], Yiddish [yid]

ן → נ Affects: English [eng], Hebrew [heb], Ladino [lad], Old Aramaic [oar], Yiddish [yid]

ļ → Ľ Affects: Latvian [lvs], Lithuanian [lit], Livonian [liv], Unknown Language

 → a â → a î → ı û → u Affects: Turkish [tur]

ņ → Ň Affects: English [eng], Esperanto [epo], French [fra], Italian [ita], Latvian [lvs], Lithuanian [lit], Livonian [liv], Portuguese [por], Unknown Language

È → è Affects: Yoruba [yor]

ň → ʼn Affects: Czech [ces], Romani [rom], Slovak [slk], Turkmen [tuk]

ł → Ń Affects: Bavarian [bar], Belarusian [bel], Berber [ber], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], German [deu], Hungarian [hun], Indonesian [ind], Italian [ita], Kashubian [csb], Lower Sorbian [dsb], Mandarin Chinese [cmn], Navajo [nav], Polish [pol], Portuguese [por], Slovak [slk], Spanish [spa], Upper Sorbian [hsb]

I → i Affects: Azerbaijani [aze]

İ → ı Affects: Azerbaijani [aze], Crimean Tatar [crh], Dutch [nld], English [eng], Ido [ido], Ottoman Turkish [ota], Talysh [tly], Tatar [tat], Venetian [vec], Zaza [zza]

ך → כ Affects: Hebrew [heb], Old Aramaic [oar], Yiddish [yid]

ń → Ņ Affects: Belarusian [bel], Berber [ber], English [eng], Esperanto [epo], German [deu], Hungarian [hun], Lower Sorbian [dsb], Polish [pol], Slovak [slk], Spanish [spa], Upper Sorbian [hsb], Wolof [wol], Yoruba [yor]

Ơ → ơ Affects: Vietnamese [vie]

ľ → Ŀ Affects: Czech [ces], Romani [rom], Slovak [slk], Veps [vep]

ĺ → Ļ Affects: Danish [dan], Hungarian [hun], Slovak [slk], Spanish [spa]



# Punctuation and Symbols

Although most punctuation and other symbols are ignored right now, the characters below are still searchable. In the case of the dollar sign $ this is likely intentional. Maybe other currency symbols should be searchable, too.

՛ ՜ ՝ ՞ ՟ ։ Affects: Armenian [hye]

〈 〉 【 】 〔 〕 〜 Affects: Japanese [jpn]

『 Affects: Cantonese [yue], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn]

「 」 Affects: Ainu [ain], Cantonese [yue], Japanese [jpn], Korean [kor], Literary Chinese [lzh], Mandarin Chinese [cmn], Russian [rus], Shanghainese [wuu]

、 Affects: Ainu [ain], Cantonese [yue], Italian [ita], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn], Shanghainese [wuu], Spanish [spa]

၍ Affects: Burmese [mya]

· Affects: Greek [ell]

・ Affects: English [eng], Japanese [jpn]

$ Affects: Belarusian [bel], Bengali [ben], Berber [ber], Catalan [cat], CycL [cycl], Danish [dan], Dutch [nld], English [eng], Esperanto [epo], Estonian [est], Finnish [fin], French [fra], Georgian [kat], German [deu], Greek [ell], Hebrew [heb], Hindi [hin], Ilocano [ilo], Indonesian [ind], Interlingua [ina], Italian [ita], Japanese [jpn], Kabyle [kab], Lingua Franca Nova [lfn], Maltese [mlt], Marathi [mar], Polish [pol], Portuguese [por], Romanian [ron], Russian [rus], Spanish [spa], Tagalog [tgl], Turkish [tur], Turkmen [tuk], Ukrainian [ukr]

_ Affects: Arabic [ara], Basque [eus], Belarusian [bel], Berber [ber], Bulgarian [bul], Czech [ces], Dutch [nld], English [eng], Esperanto [epo], Finnish [fin], French [fra], Georgian [kat], German [deu], Hungarian [hun], Italian [ita], Japanese [jpn], Kabyle [kab], Macedonian [mkd], Mandarin Chinese [cmn], Polish [pol], Portuguese [por], Russian [rus], Serbian [srp], Spanish [spa], Swedish [swe], Tatar [tat], Turkish [tur], Uyghur [uig]

《 》 Affects: Cantonese [yue], Literary Chinese [lzh], Mandarin Chinese [cmn], Shanghainese [wuu]

' Affects: Lojban [jbo]

。 Affects: Ainu [ain], Bulgarian [bul], Cantonese [yue], Chavacano [cbk], Chinese (Jin) [cjy], Gan Chinese [gan], Hakka Chinese [hak], Irish [gle], Italian [ita], Japanese [jpn], Korean [kor], Literary Chinese [lzh], Lojban [jbo], Mandarin Chinese [cmn], Min Nan Chinese [nan], Shanghainese [wuu], Sumerian [sux], Xiang Chinese [hsn]

』 Affects: Ancient Greek [grc], Cantonese [yue], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn]

(IDEOGRAPHIC SPACE) Affects: Ainu [ain], English [eng], German [deu], Japanese [jpn], Literary Chinese [lzh], Mandarin Chinese [cmn], Turkish [tur]



# Other Unsearchable Characters

The characters below currently cannot be found by searching.

ؠ ً ٌ ٍ ْ ٕ ٖ ٗ ٘ ٚ ٛ ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ٰ ۜ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ Affects: Algerian Arabic [arq], Arabic [ara], Egyptian Arabic [arz], Gulf Arabic [afb], Iraqi Arabic [acm], Kashmiri [kas], Ottoman Turkish [ota], Persian [pes], Urdu [urd]

ë ï ð ñ ċ đ ė ģ œ š ǝ ɑ ɓ ɔ ɖ ɗ ə ɛ ɡ ɣ ɤ ɨ ɪ ɲ ɾ ʁ ʊ ʋ ʌ ʒ ʔ ᵹ Affects: Afrihili [afh], Algerian Arabic [arq], Arabic [ara], Azerbaijani [aze], Bambara [bam], Berber [ber], Catalan [cat], Cayuga [cay], Choctaw [cho], Czech [ces], Dutch [nld], English [eng], Esperanto [epo], Ewe [ewe], French [fra], Ga [gaa], Galician [glg], German [deu], Hausa [hau], Hebrew [heb], Hungarian [hun], Italian [ita], Kabyle [kab], Kalmyk [xal], Kazakh [kaz], Khmer [khm], Latin [lat], Lingala [lin], Marathi [mar], Ngeq [ngt], Old English [ang], Orizaba Nahuatl [nlv], Pulaar [fuc], Russian [rus], Spanish [spa], Tachawit [shy], Tahaggart Tamahaq [thv], Talysh [tly], Tarifit [rif], Tatar [tat], Turkish [tur]

ἂ ἃ ἅ ἣ ἳ ἴ ἵ ἷ ὓ ὖ ὗ ὢ ὣ ὥ ᾱ ῑ ῡ ῥ Affects: Ancient Greek [grc]

҃ ꙗ Affects: Old East Slavic [orv]

𒀀 𒀉 𒀊 𒀕 𒀖 𒀜 𒀝 𒀠 𒀪 𒀭 𒀲 𒀳 𒀴 𒀸 𒀾 𒁀 𒁄 𒁇 𒁉 𒁍 𒁕 𒁮 𒁯 𒁲 𒁳 𒁶 𒁹 𒁺 𒁻 𒁾 𒂊 𒂍 𒂗 𒂠 𒂦 𒂵 𒂷 𒂼 𒃮 𒃲 𒃶 𒃸 𒃻 𒄀 𒄄 𒄑 𒄘 𒄠 𒄢 𒄦 𒄨 𒄩 𒄭 𒄯 𒄰 𒄴 𒄷 𒄾 𒄿 𒅁 𒅅 𒅆 𒅇 𒅍 𒅎 𒅔 𒅗 𒅘 𒅥 𒅴 𒆕 𒆗 𒆜 𒆟 𒆠 𒆪 𒆬 𒆳 𒆷 𒇇 𒇉 𒇯 𒇳 𒇴 𒇷 𒇻 𒇽 𒈜 𒈝 𒈠 𒈣 𒈤 𒈧 𒈨 𒈪 𒈫 𒈬 𒈭 𒈾 𒉆 𒉈 𒉌 𒉘 𒉡 𒉪 𒉺 𒉽 𒉿 𒊏 𒊑 𒊒 𒊕 𒊩 𒊬 𒊭 𒊮 𒊷 𒋀 𒋃 𒋗 𒋛 𒋢 𒋤 𒋧 𒋫 𒋺 𒋻 𒋼 𒋾 𒌀 𒌅 𒌆 𒌇 𒌈 𒌉 𒌋 𒌌 𒌍 𒌒 𒌓 𒌝 𒌤 𒌦 𒌨 𒌶 𒌷 𒍂 𒍇 𒍜 𒍝 𒍠 𒍢 𒍣 𒍪 𒍼 𒐈 𒐊 𒐋 𒐼 𒑂 𒑄 𒑆 𒑏 Affects: Sumerian [sux], Unknown Language

֑ ֔ ֖ ֗ ֘ ֙ ֝ ֡ ֣ ֤ ֥ ֨ ֪ ֱ ֲ ֽ Affects: Hebrew [heb]

ៗ ៝ Affects: Central Mnong [cmo], Khmer [khm]

ຽ ໆ Affects: Lao [lao]

ʻ ʼ ʿ ˀ ˈ ˌ ː Affects: Ancient Greek [grc], Belarusian [bel], Breton [bre], Cayuga [cay], English [eng], Esperanto [epo], French [fra], German [deu], Hawaiian [haw], Hebrew [heb], Italian [ita], Kabyle [kab], Navajo [nav], Ngeq [ngt], Niuean [niu], Russian [rus], Samoan [smo], Spanish [spa], Tahitian [tah], Tongan [ton], Ukrainian [ukr], Uzbek [uzb]

ൺ ൻ ർ ൽ ൾ Affects: Malayalam [mal]

ᠠ ᠨ ᠩ ᠪ ᠮ ᠰ ᡝ ᡠ ᡤ ᡥ ᡩ ᡳ ᡵ Affects: Manchu [mnc]

𑣁 𑣂 𑣅 𑣈 𑣋 𑣌 𑣓 𑣖 𑣗 𑣘 𑣙 𑣜 Affects: Ho [hoc]

𐰀 𐰃 𐰆 𐰇 𐰉 𐰋 𐰍 𐰓 𐰕 𐰖 𐰘 𐰚 𐰞 𐰢 𐰣 𐰲 𐰸 𐰺 𐰼 𐰾 𐱃 𐱅 Affects: Old Turkish [otk]

𐌰 𐌱 𐌲 𐌳 𐌴 𐌵 𐌶 𐌷 𐌸 𐌹 𐌺 𐌻 𐌼 𐌽 𐌾 𐌿 𐍀 𐍂 𐍃 𐍄 𐍅 𐍆 𐍈 𐍉 Affects: Gothic [got]

ꦁ ꦂ ꦃ ꦏ ꦒ ꦔ ꦕ ꦗ ꦚ ꦠ ꦡ ꦢ ꦣ ꦤ ꦥ ꦧ ꦩ ꦪ ꦫ ꦭ ꦮ ꦰ ꦱ ꦲ ꦴ ꦶ ꦸ ꦺ ꦼ ꧀ Affects: Javanese [jav]

ꀁ ꀃ ꀐ ꀕ ꁧ ꂘ ꂯ ꃀ ꆍ ꆏ ꆹ ꇩ ꇬ ꇿ ꈍ ꉡ ꉬ ꊿ ꋋ ꋙ ꋠ ꌕ ꍏ ꏃ ꐥ ꑋ ꑍ ꑬ Affects: Unknown Language

ㇰ ㇱ ㇷ ㇻ ㇼ ㇽ ㇾ ㇿ Affects: Ainu [ain]

︎ Affects: Japanese [jpn]



# Ignored Intentionally

Some of these characters are ignored intentionally. The difference to simply being unknown is that e.g. a word with ignored Hebrew vowel marks like "שִׁבְטְךָ֥" is treated like a single word, while a word with unknown Arabic vowel marks like "گیوٗر" is split into parts.

(SOFT HYPHEN) ́ Affects: Latin [lat], Russian [rus]

(SOFT HYPHEN) ְ ֱ ֲ ֳ ִ ֵ ֶ ַ ָ ֹ ֺ ֻ ּ ֽ ־ ֿ ׀ ׁ ׂ ׃ ׄ ׅ ׇ Affects: All Languages [all]
hide replies
AlanF_US
10 days ago
Yorwba, it's wonderful that you took the time to look into this. Three requests:

(1) Please transfer this to an issue ticket in GitHub.
(2) When you do transfer it to an issue ticket, if you could come up with some way to list the (numeric) code points in addition to the characters, that would be great.
(3) Also, when you transfer it to an issue ticket, if you have some time, could you please do some analysis on how this would affect the sentences that are actually in our corpus? In many cases, I'm guessing that our users are actually not using any of the character variants you list.

Many thanks!
hide replies
Yorwba
10 days ago
(1) I was thinking about multiple issues for each script/group of characters where a problem is identified, but I guess one mega-issue with checkboxes to keep track of incremental progress would also work. FWIW I also plan to work on PRs for the more obvious cases as soon as my VM finishes provisioning, which should be any day now...

(2) Obviously I have all this data in a format that makes it easier to identify the specific codepoints involved. I didn't include that info here because it's probably not that relevant to the community at large.

(3) Each of those characters is used at least once in at least one of the languages listed on the same line. I can count the number of sentences in each case as well, if that's necessary for prioritization.
Yorwba
9 days ago
Thanuir
10 days ago
The different encodings should be treated as the same. I write Finnish with a non-native keyboard, so I combine ¨ and a to get ä, and it would be strange if this was not treated the same as the usual ä.


> ª → a º → o

To me, the first ones appear as super-indices and the second as usual. If so, I am not sure about what would be the best way to treat these. Some examples would be helpful.

Numbers in powers have a different meaning then other numbers in mathematics, but this is not terribly relevant for Tatoeba. Same with other upper and lower indices.


> ¼ → 14 ½ → 12 ⅓ → 13

Would not ¼ → 1/4 ½ → 1/2 ⅓ → 1/3 be better? If slash is recognized, that is.


Euro symbol should have the same status as dollar symbol and other (internationally recognized) currency symbols.
hide replies
Yorwba
10 days ago
Maybe your keyboard is actually smart enough to directly combine the individual codepoints. However, some people are certainly entering them separately.

> ª → a º → o

Those are used in abbreviations, e.g. in Portuguese #1014908 . As for Finnish, it appears *someone* confused superscript o and 0 when writing #7992705 .

> Some examples would be helpful.

I would have liked to include some, but the post is bloated enough as it is.

> Would not ¼ → 1/4 ½ → 1/2 ⅓ → 1/3 be better? If slash is recognized, that is.

Yes. The method I used to come up with these pairings is not perfect, and I was hoping to get this kind of suggestion. Slash is not recognized, I think, but searching for "1/3" would still find "1/3" in a sentence. (As well as "1.3")
hide replies
Thanuir
10 days ago - 10 days ago
(The power in that one sentence is copy-pasted from the English one, where I tried asking what the notation means and was told that it is a zero. I did not know it was the letter o. I'll fix that.

Or, rather, I would fix it if I could figure out how to write superscripts or how to copy-paste them without them turning into normal sized numbers.)
hide replies
Yorwba
9 days ago
Hm, evidently you were able to copy-paste superscript characters before. Does the superscript zero from my post (² → 2 ³ → 3 ¹ → 1 ⁰ → 0 ⁸ → 8 ⁿ → n) work?
hide replies
Thanuir
9 days ago
That seems to work, yes. I edited the sentences and suggested it to be changed in some of the linked ones. Thanks.
mramosch
10 days ago
Sentence #8226052

R.I.P.
hide replies
marafon
10 days ago
Seael
10 days ago
CK
CK
10 days ago - 10 days ago
Today's Sentence Stats Compared with about 2 Years Ago.

https://imgur.com/ByIq54y

These are screen shots of the top of this page.
https://tatoeba.org/eng/stats/s...es_by_language