Menu
Here's a list with ~2156 duplicates in the corpus (there are 4362 sentences in total). They differ only on spacing or some ~equivalent punctuation. It means, Horus can't merge them.
If you own any of these sentences, please consider changing the spacing or the punctuation if it does not change the meaning (or correctness) of the sentence.
This is how the file looks like: #id1 #id2 lang sentence1 lang sentence2
https://github.com/Tatoeba/tato...0/_rep_v02.txt
related: https://github.com/Tatoeba/tatoeba2/issues/642
And here's the number of duplicates by language
fra (1419)
rus (252)
epo (61)
tur (54)
bre (54)
ber (46)
deu (39)
ukr (26)
eng (25)
kab (25)
hun (17)
ara (17)
ita (16)
fin (16)
spa (7)
heb (6)
pol (6)
ell (6)
cor (6)
por (4)
lat (4)
bul (4)
tat (4)
eus (4)
thv (4)
mar (3)
srp (3)
pes (3)
oci (3)
jpn (2)
swe (2)
hin (2)
vie (2)
uig (2)
khm (2)
nld (1)
dan (1)
ina (1)
tlh (1)
afr (1)
slk (1)
urd (1)
ceb (1)
swg (1)
cym (1)
Jos kielessä on useita tapoja kirjoittaa lainausmerkit tai välimerkit, en näe mitään syytä olla sisällyttämättä kaikkia käytäntöjä myös Tatoebaan. Toisaalta, kielen normin vastaiset käyttötavat voisi korvata oikeammilla.
Muutamassa suomalaisessa lauseessa oli oikeinkirjoituksen vastaisia lainausmerkkejä; korjasin tai pyysin korjaamaan. Osa sen sijaan johtuu siitä, että kielessä on kolme erilaista tapaa ilmaista lainauksia tai puheenvuoroja. Ne ovat kaikki oikein.
...
If the language has several different and correct ways of writing something - quotes in Finnish, space or no space before certain punctuation in French - then all of these should be fine in Tatoeba. On the other hand, incorrect use of punctuation should be corrected.
[not needed anymore- removed by CK]
If the symbols look almost the same and are used in the same way, then it might not do too much damage to standardize them.
On the other hand, if the symbols (or the use of punctuation) look markedly different, or if the different uses follow national, socio-economic or other boundaries, then standardization is not a good idea.
If there are various different appearances, then knowing all of them is helpful.
If there is a regional or other difference between the conventions, than choosing one over the others will elevate one cultural practice over another. This is a highly political act and not something should be done by Tatoeba.
For reference, here are the ways of quoting/expressing speech in Finnish, recovered from https://www.kielikello.fi/-/lainausmerkit- :
”Tule mukaan!” hän pyysi.
»Tule mukaan!» hän pyysi.
– Tule mukaan! hän pyysi.
All three ways are visually distinct and the third way works differently than the others. The first and the third are frequently used. The second is met in literature, but is rare elsewhere, as far as I know.
It is a common interrogation, but when thinking thoroughly, that is actually not such a big issue.
7+ millions sentences, 4000+ duplicates
That is not even 0.1% of the corpus.
Most of them are due to the French spaces. That discussion happened in the past. Numbers were even calculated and posted on github. Suppose that each of them would have 50 direct translations (and they don't), that would still represent only around two percent of the corpus.
In my opinion, it is not worth putting more effort than human screening to address this point. And as Thanuir mentioned, some of them cannot be de-duplicated for cultural reasons (among others).