Veggskilaboð frá sysko

sysko {{ icon }}

keyboard_arrow_right

Notandasíða

keyboard_arrow_right

Setningar

keyboard_arrow_right

Orðaforði

keyboard_arrow_right

Dómar

keyboard_arrow_right

Listar

keyboard_arrow_right

Eftirlæti

keyboard_arrow_right

Ummæli

keyboard_arrow_right

Ummæli á setningum frá sysko

keyboard_arrow_right

Veggskilaboð

keyboard_arrow_right

Saga

keyboard_arrow_right

Upptökur

keyboard_arrow_right

Umritunir

translate

Þýða setningar frá sysko

sysko 18. desember 2010 18. desember 2010 kl. 23:35:44 UTC

link

Tengill

ok I've added (finally found that crappy bug, a stupid mistake from mine in the code) romanization for Cantonese, but now I see that in fact there's seems to not have so many "words" outside single character transcription.
Anyway this afternoon I've checked the web and it seems that the world is missing a "open source" list of cantonese words (there's one free, but as free beer not free speech, which is cantodict, but as the leader of this project doesn't plan to release the data ...)
so maybe we can start making such a list :)

sysko 18. desember 2010 18. desember 2010 kl. 17:28:10 UTC

link

Tengill

ok glad to hear that :)
btw I've finished to code it and also integrated it, but I'm facing with a bug (the software starts to take all the CPU ressource) which occurs only when I try to put in the real website Oo, weird. I will try to see why tomorow (in fact the problem does not come from the cantonese itself, because I've adapated the entire code of Sinoparser to be more flexible if we had support for other Chinese language (such as Shanghainese))

sysko 18. desember 2010 18. desember 2010 kl. 15:28:53 UTC

link

Tengill

if it's the first case, then it can be solve by adding more data (which mean I can continue to use the same algorithm for generating the romanization of Cantonese)
otherwise it will need a new layer on my software, add a grammar analyser to the current lexical parser.

sysko 18. desember 2010 18. desember 2010 kl. 15:24:01 UTC

link

Tengill

ok I've finished it, now I just need to integrate it. It will produce maybe less accurate result, and will maybe often cut it character by character as the data file I use for the cantonese has "only" 45000 entries (which is low if you count all the entry for every single sinogram, there is no so much "words") , and the one for mandarin, more than 200 000. But I will try to find other "open source" dictionnaries to have a better word segmentation with a pronounciation generated from the current data file, and after with the help of tatoeba we will able to raffine this step by step

@nickyeow for the problem you're talking about is it like in Mandarin for 的 which can be "de" or "di" but that we can guess 95% of the time automatically because alone it's "de" and by having the entry "的确" => "dique" we can handle this '(i.e if you are able to segment correctly the sentence then you can guess 99% of the time which romanization it is)

or is it much more like Japanese, where even if you're able to segment the sentence into "words" the pronounciation differ depending on the meaning of the words (i.e you need to understand the sentence to guess which one to choose)
?

sysko 18. desember 2010 18. desember 2010 kl. 09:20:07 UTC

link

Tengill

what a coincidence, I was speaking with Demetrius last time and he showed me a list of cantonese words with romanization, so I'm adapting "sinoparser" (let's call this software this way, except if someone has an idea for a better name) to support Cantonese with jyutping :)

sysko 16. desember 2010 16. desember 2010 kl. 20:10:42 UTC

link

Tengill

http://en.wiktionary.org/wiki/W...requency_lists
there you can found some frequency list for a bunch of languages.

sysko 16. desember 2010 16. desember 2010 kl. 08:44:18 UTC

link

Tengill

to ease the navigation can you add an anchor to these 3 categories on your page (otherwise we need to scroll again and again :) )
by the way great idea (as ever :p)
I will try to do the same for Chinese and French.

sysko 16. desember 2010 16. desember 2010 kl. 07:21:46 UTC

link

Tengill

Yep I know about this, but unfortunately the server is already used to the maximun of its capacity (the current load average is 1.36 1.60 1.84) so we can't enable this option without degrading the overall performance, but for sure we will activate it as soon as we have the new version.

sysko 15. desember 2010 15. desember 2010 kl. 20:30:43 UTC

link

Tengill

for those interested I will try to clean up the code, add some docs, and release it as a free software.

sysko 15. desember 2010 15. desember 2010 kl. 20:29:54 UTC

link

Tengill

Now the pinyin, script detection / script conversion for sentences is made by a homebrew software, it should fix all the problems of strange conversion (trash characters etc.).
As said, as it's "homebrew", if you find a non-accurate transcription/segmentation or a bug, please report it here :)

sysko 12. desember 2010 12. desember 2010 kl. 04:40:43 UTC

link

Tengill

to be honnest before you propose other solution
we're thinking about it for one year, and there's no simple solution to this problem with the current architecture, and as we're few developpers, I prefer to focus my free time on the new version rather than trying to find and develop a new one, which will only increase the time before we get this new version which will solve in a smart way these problems

sysko 12. desember 2010 12. desember 2010 kl. 04:38:22 UTC

link

Tengill

this system would be a hell to maintain

1 - computers are fast to deal with numbers, but become slow when it comes to deal with characters
2 - it's easy to done if it was all about tree, but unfortunately we're dealing with graph, so your proposition bring the following problems
* we will need to update it when we delete a sentence
* the same when we mix to graph, by adding a link
and moreover it will still doesn't solve the problem which is traversing the graph, as you will still need to traverse it to discover there is already a epo2 and so

sysko 10. desember 2010 10. desember 2010 kl. 20:08:46 UTC

link

Tengill

so by a collateral effect it will affect not only the performance of those who wants.

sysko 10. desember 2010 10. desember 2010 kl. 20:07:40 UTC

link

Tengill

in fact what i've shown has been made with the new version
it's a hell to code with normal database, and unfortunately we only have one server, so if the server take 10 seconds to generate my page, during this 10 seconds people who don't care will still also need to wait 10 seconds.

sysko 10. desember 2010 10. desember 2010 kl. 14:17:31 UTC

link

Tengill

so yep it possible, and the script was easier to do, and was done as a temporary solution, waiting we finish this new version.

sysko 10. desember 2010 10. desember 2010 kl. 14:16:15 UTC

link

Tengill

unfortunately as discussed before, the reason we can't show the whole translation graph is because normal database system are really bad to make this kind of operation. So the best we can do with all the possible optimization is a 2degree depth chain, with the current system.
In theory it would be possible, but it would be slow as hell.

That's the reason why we've started to build our own database server for our specific need, to permit this.
So in the future it will be possible
http://static.tatoeba.org/425123.html (it's a page shot of the version I have on my computer, don't pay attention to how ugly it is) as you can see there we view every translations, whatever the degree of depth.
And anyway our database will be able to detect duplicate on the fly.:)

sysko 10. desember 2010 10. desember 2010 kl. 06:52:36 UTC

link

Tengill

For duplicates, I'm going to test an other solution to handle this in a semi-automatic way

I will replicate the tatoeba database on my personnal computer
I will run on it a slower but safer script (that I couldn't run here without slowing down tatoeba.org during some hours) that will output all the modification that need to be done on the database
I will run this output script on tatoeba.org
so this way it should be ok.

sysko 7. desember 2010 7. desember 2010 kl. 06:41:34 UTC

link

Tengill

I think it's just because you don't have the right font, because on computers (I don't know exactly on Mac, but on linux/windows/etc. this is the case) the behaviour with caracters rendering is the following

1 try to display the character with the font specified by the software
2 if the font is not present or the character can't be render by this font, then there's a set of rules to use some fallback fonts
3 if no font can render this caracter then display a box

so even if the css was using a font which has no Malayalam characters, your OS would have used an other which has.

sysko 6. desember 2010 6. desember 2010 kl. 19:18:07 UTC

link

Tengill

it's unicode encoding, maybe you don't have the right fonts for malayalam ?

sysko 6. desember 2010 6. desember 2010 kl. 18:34:41 UTC

link

Tengill

what about now ?

Vantar þér aðstoð?

Forritarar

Um

Veggskilaboð frá sysko (samtals 1.397)

Vantar þér aðstoð?

Forritarar

Um