Thread #32112 - Tatoeba

Here's a list with ~2156 duplicates in the corpus (there are 4362 sentences in total). They differ only on spacing or some ~equivalent punctuation. It means, Horus can't merge them.

If you own any of these sentences, please consider changing the spacing or the punctuation if it does not change the meaning (or correctness) of the sentence.

This is how the file looks like: #id1 #id2 lang sentence1 lang sentence2

https://github.com/Tatoeba/tato...0/_rep_v02.txt

related: https://github.com/Tatoeba/tatoeba2/issues/642

And here's the number of duplicates by language

fra (1419)
rus (252)
epo (61)
tur (54)
bre (54)
ber (46)
deu (39)
ukr (26)
eng (25)
kab (25)
hun (17)
ara (17)
ita (16)
fin (16)
spa (7)
heb (6)
pol (6)
ell (6)
cor (6)
por (4)
lat (4)
bul (4)
tat (4)
eus (4)
thv (4)
mar (3)
srp (3)
pes (3)
oci (3)
jpn (2)
swe (2)
hin (2)
vie (2)
uig (2)
khm (2)
nld (1)
dan (1)
ina (1)
tlh (1)
afr (1)
slk (1)
urd (1)
ceb (1)
swg (1)
cym (1)

hide replies show replies

Thanuir June 27, 2019 June 27, 2019 at 8:42:27 AM UTC

link

Permalink

Jos kielessä on useita tapoja kirjoittaa lainausmerkit tai välimerkit, en näe mitään syytä olla sisällyttämättä kaikkia käytäntöjä myös Tatoebaan. Toisaalta, kielen normin vastaiset käyttötavat voisi korvata oikeammilla.

Muutamassa suomalaisessa lauseessa oli oikeinkirjoituksen vastaisia lainausmerkkejä; korjasin tai pyysin korjaamaan. Osa sen sijaan johtuu siitä, että kielessä on kolme erilaista tapaa ilmaista lainauksia tai puheenvuoroja. Ne ovat kaikki oikein.

...

If the language has several different and correct ways of writing something - quotes in Finnish, space or no space before certain punctuation in French - then all of these should be fine in Tatoeba. On the other hand, incorrect use of punctuation should be corrected.

hide replies show replies

CK June 27, 2019, edited November 6, 2019 June 27, 2019 at 8:50:51 AM UTC, edited November 6, 2019 at 6:11:30 AM UTC

link

Permalink

[not needed anymore- removed by CK]

hide replies show replies

Thanuir June 27, 2019 June 27, 2019 at 9:13:21 AM UTC

link

Permalink

If the symbols look almost the same and are used in the same way, then it might not do too much damage to standardize them.

On the other hand, if the symbols (or the use of punctuation) look markedly different, or if the different uses follow national, socio-economic or other boundaries, then standardization is not a good idea.

If there are various different appearances, then knowing all of them is helpful.

If there is a regional or other difference between the conventions, than choosing one over the others will elevate one cultural practice over another. This is a highly political act and not something should be done by Tatoeba.

For reference, here are the ways of quoting/expressing speech in Finnish, recovered from https://www.kielikello.fi/-/lainausmerkit- :
”Tule mukaan!” hän pyysi.
»Tule mukaan!» hän pyysi.
– Tule mukaan! hän pyysi.

All three ways are visually distinct and the third way works differently than the others. The first and the third are frequently used. The second is met in literature, but is rare elsewhere, as far as I know.

Aiji June 27, 2019 June 27, 2019 at 1:49:02 PM UTC

link

Permalink

It is a common interrogation, but when thinking thoroughly, that is actually not such a big issue.
7+ millions sentences, 4000+ duplicates
That is not even 0.1% of the corpus.

Most of them are due to the French spaces. That discussion happened in the past. Numbers were even calculated and posted on github. Suppose that each of them would have 50 direct translations (and they don't), that would still represent only around two percent of the corpus.

In my opinion, it is not worth putting more effort than human screening to address this point. And as Thanuir mentioned, some of them cannot be de-duplicated for cultural reasons (among others).

Menu

Need some help?

Developers

About