menu
Tatoeba
language
S'inscriure Connexion
language Occitan
menu
Tatoeba

chevron_right S'inscriure

chevron_right Connexion

Percórrer

chevron_right Afichar la frasa aleatòria

chevron_right Percórrer per lenga

chevron_right Percórrer per lista

chevron_right Percórrer per etiqueta

chevron_right Percórrer los enregistraments àudio

Community

chevron_right Paret

chevron_right Lista de totes los membres

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
sysko {{ icon }} keyboard_arrow_right

Perfil

keyboard_arrow_right

Frasas

keyboard_arrow_right

Vocabulary

keyboard_arrow_right

Reviews

keyboard_arrow_right

Lists

keyboard_arrow_right

Marcapaginas

keyboard_arrow_right

Comentaris

keyboard_arrow_right

Comentaris sus las frasas de sysko

keyboard_arrow_right

Cabinats

keyboard_arrow_right

Jornals

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptions

translate

Translate sysko's sentences

Cabinets de sysko sus la paret (total 1397)

sysko sysko December 18, 2010 December 18, 2010 at 11:35:44 PM UTC link Permalink

ok I've added (finally found that crappy bug, a stupid mistake from mine in the code) romanization for Cantonese, but now I see that in fact there's seems to not have so many "words" outside single character transcription.
Anyway this afternoon I've checked the web and it seems that the world is missing a "open source" list of cantonese words (there's one free, but as free beer not free speech, which is cantodict, but as the leader of this project doesn't plan to release the data ...)
so maybe we can start making such a list :)

sysko sysko December 18, 2010 December 18, 2010 at 5:28:10 PM UTC link Permalink

ok glad to hear that :)
btw I've finished to code it and also integrated it, but I'm facing with a bug (the software starts to take all the CPU ressource) which occurs only when I try to put in the real website Oo, weird. I will try to see why tomorow (in fact the problem does not come from the cantonese itself, because I've adapated the entire code of Sinoparser to be more flexible if we had support for other Chinese language (such as Shanghainese))

sysko sysko December 18, 2010 December 18, 2010 at 3:28:53 PM UTC link Permalink

if it's the first case, then it can be solve by adding more data (which mean I can continue to use the same algorithm for generating the romanization of Cantonese)
otherwise it will need a new layer on my software, add a grammar analyser to the current lexical parser.

sysko sysko December 18, 2010 December 18, 2010 at 3:24:01 PM UTC link Permalink

ok I've finished it, now I just need to integrate it. It will produce maybe less accurate result, and will maybe often cut it character by character as the data file I use for the cantonese has "only" 45000 entries (which is low if you count all the entry for every single sinogram, there is no so much "words") , and the one for mandarin, more than 200 000. But I will try to find other "open source" dictionnaries to have a better word segmentation with a pronounciation generated from the current data file, and after with the help of tatoeba we will able to raffine this step by step

@nickyeow for the problem you're talking about is it like in Mandarin for 的 which can be "de" or "di" but that we can guess 95% of the time automatically because alone it's "de" and by having the entry "的确" => "dique" we can handle this '(i.e if you are able to segment correctly the sentence then you can guess 99% of the time which romanization it is)

or is it much more like Japanese, where even if you're able to segment the sentence into "words" the pronounciation differ depending on the meaning of the words (i.e you need to understand the sentence to guess which one to choose)
?

sysko sysko December 18, 2010 December 18, 2010 at 9:20:07 AM UTC link Permalink

what a coincidence, I was speaking with Demetrius last time and he showed me a list of cantonese words with romanization, so I'm adapting "sinoparser" (let's call this software this way, except if someone has an idea for a better name) to support Cantonese with jyutping :)

sysko sysko December 16, 2010 December 16, 2010 at 8:10:42 PM UTC link Permalink

http://en.wiktionary.org/wiki/W...requency_lists
there you can found some frequency list for a bunch of languages.

sysko sysko December 16, 2010 December 16, 2010 at 8:44:18 AM UTC link Permalink

to ease the navigation can you add an anchor to these 3 categories on your page (otherwise we need to scroll again and again :) )
by the way great idea (as ever :p)
I will try to do the same for Chinese and French.

sysko sysko December 16, 2010 December 16, 2010 at 7:21:46 AM UTC link Permalink

Yep I know about this, but unfortunately the server is already used to the maximun of its capacity (the current load average is 1.36 1.60 1.84) so we can't enable this option without degrading the overall performance, but for sure we will activate it as soon as we have the new version.

sysko sysko December 15, 2010 December 15, 2010 at 8:30:43 PM UTC link Permalink

for those interested I will try to clean up the code, add some docs, and release it as a free software.

sysko sysko December 15, 2010 December 15, 2010 at 8:29:54 PM UTC link Permalink

Now the pinyin, script detection / script conversion for sentences is made by a homebrew software, it should fix all the problems of strange conversion (trash characters etc.).
As said, as it's "homebrew", if you find a non-accurate transcription/segmentation or a bug, please report it here :)

sysko sysko December 12, 2010 December 12, 2010 at 4:40:43 AM UTC link Permalink

to be honnest before you propose other solution
we're thinking about it for one year, and there's no simple solution to this problem with the current architecture, and as we're few developpers, I prefer to focus my free time on the new version rather than trying to find and develop a new one, which will only increase the time before we get this new version which will solve in a smart way these problems

sysko sysko December 12, 2010 December 12, 2010 at 4:38:22 AM UTC link Permalink

this system would be a hell to maintain

1 - computers are fast to deal with numbers, but become slow when it comes to deal with characters
2 - it's easy to done if it was all about tree, but unfortunately we're dealing with graph, so your proposition bring the following problems
* we will need to update it when we delete a sentence
* the same when we mix to graph, by adding a link
and moreover it will still doesn't solve the problem which is traversing the graph, as you will still need to traverse it to discover there is already a epo2 and so

sysko sysko December 10, 2010 December 10, 2010 at 8:08:46 PM UTC link Permalink

so by a collateral effect it will affect not only the performance of those who wants.

sysko sysko December 10, 2010 December 10, 2010 at 8:07:40 PM UTC link Permalink

in fact what i've shown has been made with the new version
it's a hell to code with normal database, and unfortunately we only have one server, so if the server take 10 seconds to generate my page, during this 10 seconds people who don't care will still also need to wait 10 seconds.

sysko sysko December 10, 2010 December 10, 2010 at 2:17:31 PM UTC link Permalink

so yep it possible, and the script was easier to do, and was done as a temporary solution, waiting we finish this new version.

sysko sysko December 10, 2010 December 10, 2010 at 2:16:15 PM UTC link Permalink

unfortunately as discussed before, the reason we can't show the whole translation graph is because normal database system are really bad to make this kind of operation. So the best we can do with all the possible optimization is a 2degree depth chain, with the current system.
In theory it would be possible, but it would be slow as hell.

That's the reason why we've started to build our own database server for our specific need, to permit this.
So in the future it will be possible
http://static.tatoeba.org/425123.html (it's a page shot of the version I have on my computer, don't pay attention to how ugly it is) as you can see there we view every translations, whatever the degree of depth.
And anyway our database will be able to detect duplicate on the fly.:)

sysko sysko December 10, 2010 December 10, 2010 at 6:52:36 AM UTC link Permalink

For duplicates, I'm going to test an other solution to handle this in a semi-automatic way

I will replicate the tatoeba database on my personnal computer
I will run on it a slower but safer script (that I couldn't run here without slowing down tatoeba.org during some hours) that will output all the modification that need to be done on the database
I will run this output script on tatoeba.org
so this way it should be ok.

sysko sysko December 7, 2010 December 7, 2010 at 6:41:34 AM UTC link Permalink

I think it's just because you don't have the right font, because on computers (I don't know exactly on Mac, but on linux/windows/etc. this is the case) the behaviour with caracters rendering is the following

1 try to display the character with the font specified by the software
2 if the font is not present or the character can't be render by this font, then there's a set of rules to use some fallback fonts
3 if no font can render this caracter then display a box

so even if the css was using a font which has no Malayalam characters, your OS would have used an other which has.

sysko sysko December 6, 2010 December 6, 2010 at 7:18:07 PM UTC link Permalink

it's unicode encoding, maybe you don't have the right fonts for malayalam ?

sysko sysko December 6, 2010 December 6, 2010 at 6:34:41 PM UTC link Permalink

what about now ?