Menu
[not needed anymore- removed by CK]
Is there an open issue that would just detect that the sentence is a duplicate and show a warning in such case?
[not needed anymore- removed by CK]
It should be fairly easy to normalize sentences in some respects. For example, all white-space characters could be normalized to plain ASCII space. Maybe some languages would need specialized rules, but still. It should in fact happen before a sentence is stored in the database.
I believe some people take great pain in using proper spacing character, for example non-braking space, around % in German, so forcing plain space might be wrong approach. However, maybe merging script could report sentences that differ only by a single letter? So admin can decide what to do with it.
This is an interesting point. I didn't even know about the % sign rules in other languages.
Differing by one letter is too narrow a solution. What about sentences differing by two letters?
Maybe normalization rules should be defined strictly per language and not employed until confirmed to be correct.
There is also always possibility to normalize for the sake of comparing sentences, but that that normalization is not carried out as a final version of the sentence. For example, always using plain space in comparisons, but if sentence use non-breaking space, leave it (as a principle to leave more complex one as a chosen example to survive deduplication process, and use simpler as normalized variant). But, it can get messy, that is: it could bring more problems. It is hard job.
That would be the best approach in my humble opinion. What kind of problems are you thinking about?
As usual: finding a balance is the key. What if we got several duplicates, but every one has slightly different issue with it? Which one is a good representation of that group of probable duplicates? Other thing: what if one start with a wrong idea that as a normalization prototype we could remove all the interpunctions and than compare things (we can't) - causing huge falsely identified duplicates... So, what we can use for certain as a duplicate, and what we cant. If only a "space" would be the only problem here, but I guess it would be a good start (and I guess, various sorts of hyphens, multiple consecutive signs like !!!, etc)...
The differences between the 2 spaces is one is a normal space and the other is non-breaking space.
I should be using non-breaking spaces before double-points :;?!
But as I explained earlier in a conversation on the wall with Impersonator, non-breaking spaces render as ugly squares on Tatoeba, depending on your operating system and browser.
When Tatoeba comes up with a solution for these non-breaking spaces to look as they should, I'll use them. Meanwhile, I don't want my sentences to look like some cabalistic mess.
and to be complete, you can also see some of my sentences with non-breaking spaces, but I wasn't the one to introduce those. At some point in the history of Tatoeba,they were automatically substituted from my standard spaces by some batch procedure.
This substitution should be automatic, at the sentence insertion, if and only if the non-breaking spaces are correctly displayed, regardless of browser and operating system. I don't think it is the case now but I haven't checked for a while. Maybe it was improved. I haven't been informed whether a change took place.
I remember seing these ugly squares on iOS but I can't see them anymore now on the examples you supplied.
Now that it is possible to drag-and-drop translations across pages and sentences, without the use of a bookmarklet, I find it much easier to avoid creating duplicates.
That will obviously show in the future.
For those who are not yet accustomed to the procedure, here is how I proceed :
I use a first webpage to search for sentences in language A that I want to translate.
I then use a second webpage, in a different browser's tab, to search for sentences in language B that possibly match those in the first page.
If I find some, I drag each from the second page to the first, across the browser's tabs, down on the new chain-link-icon button above the sentence to translate, and I simply drop it on this button.
It's very comfortable this way and enables to gain precious productivity.
It's not yet possible to drag from the main sentences if you own them (but you can always inverse the order by clicking on a direct translation) or from their blue buttons, but I was told this would come in a next release.
It will be bliss.