menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
MacGyver MacGyver June 27, 2019, edited June 27, 2019 June 27, 2019 at 7:15:33 AM UTC, edited June 27, 2019 at 9:23:20 AM UTC link Permalink

Here's a list with ~2156 duplicates in the corpus (there are 4362 sentences in total). They differ only on spacing or some ~equivalent punctuation. It means, Horus can't merge them.

If you own any of these sentences, please consider changing the spacing or the punctuation if it does not change the meaning (or correctness) of the sentence.

This is how the file looks like: #id1 #id2 lang sentence1 lang sentence2

https://github.com/Tatoeba/tato...0/_rep_v02.txt

related: https://github.com/Tatoeba/tatoeba2/issues/642

And here's the number of duplicates by language

fra (1419)
rus (252)
epo (61)
tur (54)
bre (54)
ber (46)
deu (39)
ukr (26)
eng (25)
kab (25)
hun (17)
ara (17)
ita (16)
fin (16)
spa (7)
heb (6)
pol (6)
ell (6)
cor (6)
por (4)
lat (4)
bul (4)
tat (4)
eus (4)
thv (4)
mar (3)
srp (3)
pes (3)
oci (3)
jpn (2)
swe (2)
hin (2)
vie (2)
uig (2)
khm (2)
nld (1)
dan (1)
ina (1)
tlh (1)
afr (1)
slk (1)
urd (1)
ceb (1)
swg (1)
cym (1)

{{vm.hiddenReplies[32112] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir June 27, 2019 June 27, 2019 at 8:42:27 AM UTC link Permalink

Jos kielessä on useita tapoja kirjoittaa lainausmerkit tai välimerkit, en näe mitään syytä olla sisällyttämättä kaikkia käytäntöjä myös Tatoebaan. Toisaalta, kielen normin vastaiset käyttötavat voisi korvata oikeammilla.

Muutamassa suomalaisessa lauseessa oli oikeinkirjoituksen vastaisia lainausmerkkejä; korjasin tai pyysin korjaamaan. Osa sen sijaan johtuu siitä, että kielessä on kolme erilaista tapaa ilmaista lainauksia tai puheenvuoroja. Ne ovat kaikki oikein.

...

If the language has several different and correct ways of writing something - quotes in Finnish, space or no space before certain punctuation in French - then all of these should be fine in Tatoeba. On the other hand, incorrect use of punctuation should be corrected.

{{vm.hiddenReplies[32114] ? 'expand_more' : 'expand_less'}} hide replies show replies
CK CK June 27, 2019, edited November 6, 2019 June 27, 2019 at 8:50:51 AM UTC, edited November 6, 2019 at 6:11:30 AM UTC link Permalink

[not needed anymore- removed by CK]

{{vm.hiddenReplies[32115] ? 'expand_more' : 'expand_less'}} hide replies show replies
Thanuir Thanuir June 27, 2019 June 27, 2019 at 9:13:21 AM UTC link Permalink

If the symbols look almost the same and are used in the same way, then it might not do too much damage to standardize them.

On the other hand, if the symbols (or the use of punctuation) look markedly different, or if the different uses follow national, socio-economic or other boundaries, then standardization is not a good idea.

If there are various different appearances, then knowing all of them is helpful.

If there is a regional or other difference between the conventions, than choosing one over the others will elevate one cultural practice over another. This is a highly political act and not something should be done by Tatoeba.

For reference, here are the ways of quoting/expressing speech in Finnish, recovered from https://www.kielikello.fi/-/lainausmerkit- :
”Tule mukaan!” hän pyysi.
»Tule mukaan!» hän pyysi.
– Tule mukaan! hän pyysi.

All three ways are visually distinct and the third way works differently than the others. The first and the third are frequently used. The second is met in literature, but is rare elsewhere, as far as I know.

Aiji Aiji June 27, 2019 June 27, 2019 at 1:49:02 PM UTC link Permalink

It is a common interrogation, but when thinking thoroughly, that is actually not such a big issue.
7+ millions sentences, 4000+ duplicates
That is not even 0.1% of the corpus.

Most of them are due to the French spaces. That discussion happened in the past. Numbers were even calculated and posted on github. Suppose that each of them would have 50 direct translations (and they don't), that would still represent only around two percent of the corpus.

In my opinion, it is not worth putting more effort than human screening to address this point. And as Thanuir mentioned, some of them cannot be de-duplicated for cultural reasons (among others).