Perfil
Frasas
Vocabulary
Reviews
Lists
Marcapaginas
Comentaris
Comentaris sus las frasas de sysko
Cabinats
Jornals
Audio
Transcriptions
Translate sysko's sentences

ok I've added (finally found that crappy bug, a stupid mistake from mine in the code) romanization for Cantonese, but now I see that in fact there's seems to not have so many "words" outside single character transcription.
Anyway this afternoon I've checked the web and it seems that the world is missing a "open source" list of cantonese words (there's one free, but as free beer not free speech, which is cantodict, but as the leader of this project doesn't plan to release the data ...)
so maybe we can start making such a list :)

ok glad to hear that :)
btw I've finished to code it and also integrated it, but I'm facing with a bug (the software starts to take all the CPU ressource) which occurs only when I try to put in the real website Oo, weird. I will try to see why tomorow (in fact the problem does not come from the cantonese itself, because I've adapated the entire code of Sinoparser to be more flexible if we had support for other Chinese language (such as Shanghainese))

if it's the first case, then it can be solve by adding more data (which mean I can continue to use the same algorithm for generating the romanization of Cantonese)
otherwise it will need a new layer on my software, add a grammar analyser to the current lexical parser.

ok I've finished it, now I just need to integrate it. It will produce maybe less accurate result, and will maybe often cut it character by character as the data file I use for the cantonese has "only" 45000 entries (which is low if you count all the entry for every single sinogram, there is no so much "words") , and the one for mandarin, more than 200 000. But I will try to find other "open source" dictionnaries to have a better word segmentation with a pronounciation generated from the current data file, and after with the help of tatoeba we will able to raffine this step by step
@nickyeow for the problem you're talking about is it like in Mandarin for 的 which can be "de" or "di" but that we can guess 95% of the time automatically because alone it's "de" and by having the entry "的确" => "dique" we can handle this '(i.e if you are able to segment correctly the sentence then you can guess 99% of the time which romanization it is)
or is it much more like Japanese, where even if you're able to segment the sentence into "words" the pronounciation differ depending on the meaning of the words (i.e you need to understand the sentence to guess which one to choose)
?

what a coincidence, I was speaking with Demetrius last time and he showed me a list of cantonese words with romanization, so I'm adapting "sinoparser" (let's call this software this way, except if someone has an idea for a better name) to support Cantonese with jyutping :)

http://en.wiktionary.org/wiki/W...requency_lists
there you can found some frequency list for a bunch of languages.

to ease the navigation can you add an anchor to these 3 categories on your page (otherwise we need to scroll again and again :) )
by the way great idea (as ever :p)
I will try to do the same for Chinese and French.

Yep I know about this, but unfortunately the server is already used to the maximun of its capacity (the current load average is 1.36 1.60 1.84) so we can't enable this option without degrading the overall performance, but for sure we will activate it as soon as we have the new version.

for those interested I will try to clean up the code, add some docs, and release it as a free software.

Now the pinyin, script detection / script conversion for sentences is made by a homebrew software, it should fix all the problems of strange conversion (trash characters etc.).
As said, as it's "homebrew", if you find a non-accurate transcription/segmentation or a bug, please report it here :)

to be honnest before you propose other solution
we're thinking about it for one year, and there's no simple solution to this problem with the current architecture, and as we're few developpers, I prefer to focus my free time on the new version rather than trying to find and develop a new one, which will only increase the time before we get this new version which will solve in a smart way these problems

this system would be a hell to maintain
1 - computers are fast to deal with numbers, but become slow when it comes to deal with characters
2 - it's easy to done if it was all about tree, but unfortunately we're dealing with graph, so your proposition bring the following problems
* we will need to update it when we delete a sentence
* the same when we mix to graph, by adding a link
and moreover it will still doesn't solve the problem which is traversing the graph, as you will still need to traverse it to discover there is already a epo2 and so

so by a collateral effect it will affect not only the performance of those who wants.

in fact what i've shown has been made with the new version
it's a hell to code with normal database, and unfortunately we only have one server, so if the server take 10 seconds to generate my page, during this 10 seconds people who don't care will still also need to wait 10 seconds.

so yep it possible, and the script was easier to do, and was done as a temporary solution, waiting we finish this new version.

unfortunately as discussed before, the reason we can't show the whole translation graph is because normal database system are really bad to make this kind of operation. So the best we can do with all the possible optimization is a 2degree depth chain, with the current system.
In theory it would be possible, but it would be slow as hell.
That's the reason why we've started to build our own database server for our specific need, to permit this.
So in the future it will be possible
http://static.tatoeba.org/425123.html (it's a page shot of the version I have on my computer, don't pay attention to how ugly it is) as you can see there we view every translations, whatever the degree of depth.
And anyway our database will be able to detect duplicate on the fly.:)

For duplicates, I'm going to test an other solution to handle this in a semi-automatic way
I will replicate the tatoeba database on my personnal computer
I will run on it a slower but safer script (that I couldn't run here without slowing down tatoeba.org during some hours) that will output all the modification that need to be done on the database
I will run this output script on tatoeba.org
so this way it should be ok.

I think it's just because you don't have the right font, because on computers (I don't know exactly on Mac, but on linux/windows/etc. this is the case) the behaviour with caracters rendering is the following
1 try to display the character with the font specified by the software
2 if the font is not present or the character can't be render by this font, then there's a set of rules to use some fallback fonts
3 if no font can render this caracter then display a box
so even if the css was using a font which has no Malayalam characters, your OS would have used an other which has.

it's unicode encoding, maybe you don't have the right fonts for malayalam ?

what about now ?