Thread #20656 - Tatoeba

I believe some people take great pain in using proper spacing character, for example non-braking space, around % in German, so forcing plain space might be wrong approach. However, maybe merging script could report sentences that differ only by a single letter? So admin can decide what to do with it.

hide replies show replies

cvbge November 10, 2014 November 10, 2014 at 12:05:13 AM UTC

link

Permalink

This is an interesting point. I didn't even know about the % sign rules in other languages.

Differing by one letter is too narrow a solution. What about sentences differing by two letters?

Maybe normalization rules should be defined strictly per language and not employed until confirmed to be correct.

hide replies show replies

neron November 10, 2014 November 10, 2014 at 1:45:23 AM UTC

link

Permalink

There is also always possibility to normalize for the sake of comparing sentences, but that that normalization is not carried out as a final version of the sentence. For example, always using plain space in comparisons, but if sentence use non-breaking space, leave it (as a principle to leave more complex one as a chosen example to survive deduplication process, and use simpler as normalized variant). But, it can get messy, that is: it could bring more problems. It is hard job.

hide replies show replies

gillux November 10, 2014 November 10, 2014 at 4:17:33 PM UTC

link

Permalink

That would be the best approach in my humble opinion. What kind of problems are you thinking about?

hide replies show replies

neron November 18, 2014 November 18, 2014 at 4:33:23 PM UTC

link

Permalink

As usual: finding a balance is the key. What if we got several duplicates, but every one has slightly different issue with it? Which one is a good representation of that group of probable duplicates? Other thing: what if one start with a wrong idea that as a normalization prototype we could remove all the interpunctions and than compare things (we can't) - causing huge falsely identified duplicates... So, what we can use for certain as a duplicate, and what we cant. If only a "space" would be the only problem here, but I guess it would be a good start (and I guess, various sorts of hyphens, multiple consecutive signs like !!!, etc)...

sacredceltic November 15, 2014 November 15, 2014 at 5:43:51 PM UTC

link

Permalink

The differences between the 2 spaces is one is a normal space and the other is non-breaking space.
I should be using non-breaking spaces before double-points :;?!
But as I explained earlier in a conversation on the wall with Impersonator, non-breaking spaces render as ugly squares on Tatoeba, depending on your operating system and browser.
When Tatoeba comes up with a solution for these non-breaking spaces to look as they should, I'll use them. Meanwhile, I don't want my sentences to look like some cabalistic mess.

hide replies show replies

sacredceltic November 15, 2014 November 15, 2014 at 5:51:58 PM UTC

link

Permalink

and to be complete, you can also see some of my sentences with non-breaking spaces, but I wasn't the one to introduce those. At some point in the history of Tatoeba,they were automatically substituted from my standard spaces by some batch procedure.

This substitution should be automatic, at the sentence insertion, if and only if the non-breaking spaces are correctly displayed, regardless of browser and operating system. I don't think it is the case now but I haven't checked for a while. Maybe it was improved. I haven't been informed whether a change took place.
I remember seing these ugly squares on iOS but I can't see them anymore now on the examples you supplied.

sacredceltic November 23, 2014 November 23, 2014 at 1:03:22 PM UTC

link

Permalink

Now that it is possible to drag-and-drop translations across pages and sentences, without the use of a bookmarklet, I find it much easier to avoid creating duplicates.
That will obviously show in the future.

For those who are not yet accustomed to the procedure, here is how I proceed :

I use a first webpage to search for sentences in language A that I want to translate.
I then use a second webpage, in a different browser's tab, to search for sentences in language B that possibly match those in the first page.
If I find some, I drag each from the second page to the first, across the browser's tabs, down on the new chain-link-icon button above the sentence to translate, and I simply drop it on this button.
It's very comfortable this way and enables to gain precious productivity.

It's not yet possible to drag from the main sentences if you own them (but you can always inverse the order by clicking on a direct translation) or from their blue buttons, but I was told this would come in a next release.

It will be bliss.

Menu

Need some help?

Developers

About