menu
Tatoeba
language
Inscriber te Aperir session
language Interlingua
menu
Tatoeba

chevron_right Inscriber te

chevron_right Aperir session

Percurrer

chevron_right Monstrar phrase aleatori

chevron_right Percurrer per lingua

chevron_right Percurrer per lista

chevron_right Percurrer per etiquetta

chevron_right Percurrer audio

Communitate

chevron_right Muro

chevron_right Lista de tote le membros

chevron_right Linguas del membros

chevron_right Parlantes native

search
clear
swap_horiz
search
sysko {{ icon }} keyboard_arrow_right

Profilo

keyboard_arrow_right

Phrases

keyboard_arrow_right

Vocabulario

keyboard_arrow_right

Evalutationes

keyboard_arrow_right

Listas

keyboard_arrow_right

Favorites

keyboard_arrow_right

Commentos

keyboard_arrow_right

Commentos sur le phrases de sysko

keyboard_arrow_right

Messages de muro

keyboard_arrow_right

Registros

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptiones

translate

Traducer le phrases de sysko

Le messages de sysko sur le muro (total 1.397)

sysko sysko 18 de decembre 2010 18 de decembre 2010 a 23:35:44 UTC link Permaligamine

ok I've added (finally found that crappy bug, a stupid mistake from mine in the code) romanization for Cantonese, but now I see that in fact there's seems to not have so many "words" outside single character transcription.
Anyway this afternoon I've checked the web and it seems that the world is missing a "open source" list of cantonese words (there's one free, but as free beer not free speech, which is cantodict, but as the leader of this project doesn't plan to release the data ...)
so maybe we can start making such a list :)

sysko sysko 18 de decembre 2010 18 de decembre 2010 a 17:28:10 UTC link Permaligamine

ok glad to hear that :)
btw I've finished to code it and also integrated it, but I'm facing with a bug (the software starts to take all the CPU ressource) which occurs only when I try to put in the real website Oo, weird. I will try to see why tomorow (in fact the problem does not come from the cantonese itself, because I've adapated the entire code of Sinoparser to be more flexible if we had support for other Chinese language (such as Shanghainese))

sysko sysko 18 de decembre 2010 18 de decembre 2010 a 15:28:53 UTC link Permaligamine

if it's the first case, then it can be solve by adding more data (which mean I can continue to use the same algorithm for generating the romanization of Cantonese)
otherwise it will need a new layer on my software, add a grammar analyser to the current lexical parser.

sysko sysko 18 de decembre 2010 18 de decembre 2010 a 15:24:01 UTC link Permaligamine

ok I've finished it, now I just need to integrate it. It will produce maybe less accurate result, and will maybe often cut it character by character as the data file I use for the cantonese has "only" 45000 entries (which is low if you count all the entry for every single sinogram, there is no so much "words") , and the one for mandarin, more than 200 000. But I will try to find other "open source" dictionnaries to have a better word segmentation with a pronounciation generated from the current data file, and after with the help of tatoeba we will able to raffine this step by step

@nickyeow for the problem you're talking about is it like in Mandarin for 的 which can be "de" or "di" but that we can guess 95% of the time automatically because alone it's "de" and by having the entry "的确" => "dique" we can handle this '(i.e if you are able to segment correctly the sentence then you can guess 99% of the time which romanization it is)

or is it much more like Japanese, where even if you're able to segment the sentence into "words" the pronounciation differ depending on the meaning of the words (i.e you need to understand the sentence to guess which one to choose)
?

sysko sysko 18 de decembre 2010 18 de decembre 2010 a 09:20:07 UTC link Permaligamine

what a coincidence, I was speaking with Demetrius last time and he showed me a list of cantonese words with romanization, so I'm adapting "sinoparser" (let's call this software this way, except if someone has an idea for a better name) to support Cantonese with jyutping :)

sysko sysko 16 de decembre 2010 16 de decembre 2010 a 20:10:42 UTC link Permaligamine

http://en.wiktionary.org/wiki/W...requency_lists
there you can found some frequency list for a bunch of languages.

sysko sysko 16 de decembre 2010 16 de decembre 2010 a 08:44:18 UTC link Permaligamine

to ease the navigation can you add an anchor to these 3 categories on your page (otherwise we need to scroll again and again :) )
by the way great idea (as ever :p)
I will try to do the same for Chinese and French.

sysko sysko 16 de decembre 2010 16 de decembre 2010 a 07:21:46 UTC link Permaligamine

Yep I know about this, but unfortunately the server is already used to the maximun of its capacity (the current load average is 1.36 1.60 1.84) so we can't enable this option without degrading the overall performance, but for sure we will activate it as soon as we have the new version.

sysko sysko 15 de decembre 2010 15 de decembre 2010 a 20:30:43 UTC link Permaligamine

for those interested I will try to clean up the code, add some docs, and release it as a free software.

sysko sysko 15 de decembre 2010 15 de decembre 2010 a 20:29:54 UTC link Permaligamine

Now the pinyin, script detection / script conversion for sentences is made by a homebrew software, it should fix all the problems of strange conversion (trash characters etc.).
As said, as it's "homebrew", if you find a non-accurate transcription/segmentation or a bug, please report it here :)

sysko sysko 12 de decembre 2010 12 de decembre 2010 a 04:40:43 UTC link Permaligamine

to be honnest before you propose other solution
we're thinking about it for one year, and there's no simple solution to this problem with the current architecture, and as we're few developpers, I prefer to focus my free time on the new version rather than trying to find and develop a new one, which will only increase the time before we get this new version which will solve in a smart way these problems

sysko sysko 12 de decembre 2010 12 de decembre 2010 a 04:38:22 UTC link Permaligamine

this system would be a hell to maintain

1 - computers are fast to deal with numbers, but become slow when it comes to deal with characters
2 - it's easy to done if it was all about tree, but unfortunately we're dealing with graph, so your proposition bring the following problems
* we will need to update it when we delete a sentence
* the same when we mix to graph, by adding a link
and moreover it will still doesn't solve the problem which is traversing the graph, as you will still need to traverse it to discover there is already a epo2 and so

sysko sysko 10 de decembre 2010 10 de decembre 2010 a 20:08:46 UTC link Permaligamine

so by a collateral effect it will affect not only the performance of those who wants.

sysko sysko 10 de decembre 2010 10 de decembre 2010 a 20:07:40 UTC link Permaligamine

in fact what i've shown has been made with the new version
it's a hell to code with normal database, and unfortunately we only have one server, so if the server take 10 seconds to generate my page, during this 10 seconds people who don't care will still also need to wait 10 seconds.

sysko sysko 10 de decembre 2010 10 de decembre 2010 a 14:17:31 UTC link Permaligamine

so yep it possible, and the script was easier to do, and was done as a temporary solution, waiting we finish this new version.

sysko sysko 10 de decembre 2010 10 de decembre 2010 a 14:16:15 UTC link Permaligamine

unfortunately as discussed before, the reason we can't show the whole translation graph is because normal database system are really bad to make this kind of operation. So the best we can do with all the possible optimization is a 2degree depth chain, with the current system.
In theory it would be possible, but it would be slow as hell.

That's the reason why we've started to build our own database server for our specific need, to permit this.
So in the future it will be possible
http://static.tatoeba.org/425123.html (it's a page shot of the version I have on my computer, don't pay attention to how ugly it is) as you can see there we view every translations, whatever the degree of depth.
And anyway our database will be able to detect duplicate on the fly.:)

sysko sysko 10 de decembre 2010 10 de decembre 2010 a 06:52:36 UTC link Permaligamine

For duplicates, I'm going to test an other solution to handle this in a semi-automatic way

I will replicate the tatoeba database on my personnal computer
I will run on it a slower but safer script (that I couldn't run here without slowing down tatoeba.org during some hours) that will output all the modification that need to be done on the database
I will run this output script on tatoeba.org
so this way it should be ok.

sysko sysko 7 de decembre 2010 7 de decembre 2010 a 06:41:34 UTC link Permaligamine

I think it's just because you don't have the right font, because on computers (I don't know exactly on Mac, but on linux/windows/etc. this is the case) the behaviour with caracters rendering is the following

1 try to display the character with the font specified by the software
2 if the font is not present or the character can't be render by this font, then there's a set of rules to use some fallback fonts
3 if no font can render this caracter then display a box

so even if the css was using a font which has no Malayalam characters, your OS would have used an other which has.

sysko sysko 6 de decembre 2010 6 de decembre 2010 a 19:18:07 UTC link Permaligamine

it's unicode encoding, maybe you don't have the right fonts for malayalam ?

sysko sysko 6 de decembre 2010 6 de decembre 2010 a 18:34:41 UTC link Permaligamine

what about now ?