Pinnwand-Narichten bi sysko

sysko {{ icon }}

keyboard_arrow_right

Profil

keyboard_arrow_right

Sätz

keyboard_arrow_right

Vocabulary

keyboard_arrow_right

Reviews

keyboard_arrow_right

Lists

keyboard_arrow_right

Leestekens

keyboard_arrow_right

Kommentare

keyboard_arrow_right

Kommentare to sysko sien Sätz

keyboard_arrow_right

Pinnwandnarichten

keyboard_arrow_right

Logböker

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptions

translate

Translate sysko's sentences

sysko 2011 M02 14 2011 M02 14 06:00:42 UTC

link

Permalink

in this case one should tag it "@change" (or a similar tag) or ask a trusted user to do so. So that after moderators have a special page where we can see sentences that have beeen tagged for more than 2 weeks, and after this period, moderators will apply the change if the user has still not answered.
So to ease the process you can ask to be a trusted user :)

sysko 2011 M02 13 2011 M02 13 20:04:48 UTC

link

Permalink

Thanks to Papabear we now have more than 1000 English audios http://tatoeba.org/sentences/show/477529 \o/

sysko 2011 M02 13 2011 M02 13 06:30:08 UTC

link

Permalink

Are you refering to me :p
But I do agree it's annoying sometimes, we will change it in the new version, so that it will always keep the language on which you were before and the interface language will change only if you manually do that using the form at the top.

sysko 2011 M02 13 2011 M02 13 05:26:10 UTC

link

Permalink

actually in swac-recorder you can precise this (in the "linguistic region" part of their interface), after it is part of the meta-information of the audio file, so with that we would be able to do so. In the future we can imagine to have the same things has swac-collection.org i.e several audio per sentences, and have the meta displayed directly (in a way or an other)

sysko 2011 M02 13 2011 M02 13 04:54:48 UTC

link

Permalink

yep as said on the thread of chinese-forums, I will do that, it's just i didn't think about it the first time :)

sysko 2011 M02 10 2011 M02 10 14:53:18 UTC

link

Permalink

actually in the past we already had some spammer, but as users noticed nearly as soon as they appeared, most of the time I've deleted them before you actually see them :)

I also agree that email notification will not generate so much spam, but Trang seems to not thinkt so^^

sysko 2011 M02 8 2011 M02 8 18:12:58 UTC

link

Permalink

and me, except if you find someone else to code it :p

I'm looking at the moment on the new version to organize "news feed" (like "new sentences in French feed" etc.) so that one user will be able to follow some of them and it will display them in his homepage, so we can imagine a "need translation in" feed (though I agree in a first time the tag idea is a good workaround)
and this will be fill trough the API (not yet present, but it' in the wonderful "new version" package :p), so after developing a plugin for browser/IM can be delegated to other (so one will be free to develop a plugin for pidgin if he wants to:))
But yeah I like your idea.

sysko 2011 M02 7 2011 M02 7 07:24:36 UTC

link

Permalink

done :)

sysko 2011 M02 5 2011 M02 5 23:35:11 UTC

link

Permalink

Yep I can also do that for other languages, I just need to know which pair of languages one wants :).

sysko 2011 M02 3 2011 M02 3 22:26:34 UTC

link

Permalink

It would be not so easy actually, as for the moment we only store the sentence as "one block" so without word by word decomposition, So it would require to develop a segmenter, ok for English it's not so hard (you can make one who give "quite good" result by splitting on space and punctuation) and for Chinese I've already, but for other language it will be more complicated. The language with a lot of flexions (French, German), agglutinativ language (Turkish, Japanese) etc. for . Moreover after it will require to know how to store that in a way it can easily used by the system you want. And even once you have that, without even speaking about context, your sentence will contain a lot of words, adverb, verb, etc. so how do you tell the system that "beach" is relevant but "softly" is not? It requires some kind of "machine learning" algo.
So feasable, hmm yes, nothing impossible in the absolute, though I will not put this feature in the "easy to code" ones ^^
How much effort ? Quite a lot as we have nothing at all in the current system which can be used as a basis (except my Chinese segmenter), but developing tools to analyze the sentences is what I plan to do after the release of the new version of tatoeba, so even though I didn't think about your idea, which is a good one, at that time I will try to see if I really can do it or not :)

sysko 2011 M01 27 2011 M01 27 10:48:18 UTC

link

Permalink

I agree too, but just to say you can activate it also for input by going on
about:config and changing the value layout.spellcheckDefault to 2 :)

sysko 2011 M01 26 2011 M01 26 16:36:15 UTC

link

Permalink

it can, i've done one, but it's not as easy as it seems, because we have strong performance (both in time and place) constraint, as I'm sure you don't tatoeba to be slow as hell during even 7 hours.
So we need to do it directly in the database, the script is no long, but it need to update a lot of things, and for a weird reason there's a bug in the script, that i can't easily reproduce as it seems it's a bug in mysql itself, and it happened only when there's concurrency (as I can't stop tatoeba.org to activate the script) so that's why, and I've already spend too much trying to figure out "why" moreover this problem will disapear with the new version, so I will rather spend time trying to finish this new verion asap rather than continue to search how to fix the script.

sysko 2011 M01 26 2011 M01 26 03:53:03 UTC

link

Permalink

for "please don't feel like I'm always asking for more and more features" i will answer 'please don't feel like i'm not implementing them right now because you're annoying me" :p We really appreciate when user tell us what they would like to see in tatoeba, as anyway we're not doing that for money so the main reason we're developping stuff depends on "will users enjoy it?" so by telling us what you want you're making the task easier :p So feel free to share us all the ideas you have, even the craziest one otherwise i will not know if i'm able to code that or not :)

sysko 2011 M01 21 2011 M01 21 21:47:21 UTC

link

Permalink

but for sure as we support some languages Google doesn't (Shanghainese, Lojban etc. ) our data can help them for these languages, as anyway now they just can't recognize them.

sysko 2011 M01 21 2011 M01 21 21:46:06 UTC

link

Permalink

actually to what I already know, google already use a "corpus" to train its detection system with billion of words, so to a far larger scale than us , it's just they don't have access to some data we have (for example I'm really unlikely to add a sentence in Spanish tough sometimes a very very short Frech sentence can look like a Spanish one, and we know that the size of an input will be between 1 and lets say 100 words etc. , google can't do that).

sysko 2011 M01 21 2011 M01 21 21:41:35 UTC

link

Permalink

so i think a "single sentence" optimized algo, which some heuristic (will put a higher "weight" to language in which you've already contributed in) we should be able to provide a lower "false positive" rate.

sysko 2011 M01 21 2011 M01 21 21:39:58 UTC

link

Permalink

Yep but it does work for 90% of the case, so it's a "better than nothing" solution. Don't worry, as soon as we will have time we will manage to replace it by home-brew code, as google algo are made for big piece of text, so it's not made for "10 words or less" input.

sysko 2011 M01 21 2011 M01 21 21:37:50 UTC

link

Permalink

the cakephp version will still live a month at least i think, so if you already have it, or if it's something not so complicated to do, yep you can submit the patch yep.

For the new version, yep as for the current, I really believe in open source, even for the code of websites, so yep it will be open to everyone under the same licence as the current one (AGPL), and as I really think we're not going to move to an other framework (except if a ASM framework exists :p, tough I'm not sure it can be faster than gcc optimized binary^^) so as soon as we get something stable and documented, part of our "duty time" will move from "coding" to "manage to have tools to permit collaborativ works also on the code itself, not only on the data", as myself I wished i could have some "open website" to study to learn how "real" websites are made. I really hope in a near future tatoeba will not only be a place to build an open corpus, but also a place to build open tools to exploit the corpus (and the website is part of them), for the greater good of common knowledge.

sysko 2011 M01 21 2011 M01 21 12:05:55 UTC

link

Permalink

you'll see ;-)

sysko 2011 M01 20 2011 M01 20 19:37:32 UTC

link

Permalink

As Ludoviko asked me in a thread about the progress of the new version, and yep it's true I don't communicate a lot about the progress (except always saying "well it will be possible in the new version) so basically

The sentence database is somewhat finish, and debugged (memory leak etc.) thanks to the help of Qdii. So what's already possible with it
* view all the translations of one sentence (even the 20th degree translation), so it will be no more a problem in the next version of tatoeba
* real time detection of duplicate (even when correcting a sentence) so here again
* perfomance improvement, if we only talk about sentence+translation retrieving (not talking about html generation etc.) it's very damn fast now, kinda 10 000 time faster for the complex queries we make in tatoeba, so here again I hope it will fix the "well i don't plan to add this or that feature because the server is already over busy"

Framework / website itself

I've said it on my twitter, we're moving from cakephp to a c++ framework, during some time we tought about django, but well seems me and biptaste are not made for this and django will have not solve the performance issue. It took us some times to started to be confident with it and set up the general architecture. But now it starts to works fine and we're going to reimplment pages now (that's one of the reason I didn't talk too much about it, because for the moment the progress we've made was mainly code stuff, so nothing "visual" tough it represent a huge part of the work)
For the geeks among us who're wondering if write tatoeba in c++ with an "obscure" framework is not some stupid decision that only increase the developping time, I will say no, in fact not so much.
1 - anyway we were about to learn a new framework, so I think an important part of the development time is spend into understanding the framework rather than "typing code", and even if the community for this framework is very small, it has one subtil advantage, the main/only developer is really accessible so we can ask him directly questions about our personnal problem, he's very reactive and we can be sure of the reliability of his answer (after all it's his project:p)
2 - As I've said one of the problem of the current Tatoeba is performance, we're not making money on Tatoeba, I'm still a student and Trang used to be, so we don't have money to spend on renting server etc. so as we're developing it for free, for fun, "how many times" we spend in developing it is not an issue, the real issue is hardware, so improving by 2 the perfomance means we can handle 2 more time users without needing new hardware / needing to spent more money (tough for the moment we're kindly host by the French FSF, but well they don't have illimited ressource, and we don't want to abuse of it). So making it now in c++ will assure us we will not need to do that in the future, so I think on the long term it will save us time/money

Also need to add that now all the feature we will add will have an api counterpart, so it will ease development of third party application using tatoeba.

So it's where we are so far, not so much "visual" stuff to present, but the motor is already on a good way, and i think it was the most difficult / "not rewarding" part. Oh forget to say we also spend some times to set up some collaborativ tools as a redmine, a git repository etc. on my server,

For the moment as I think the 3 main feature of "no limit in depth translation" and "real time duplicate detection", and "speed up" are the current 3 major problem of tatoeba. The first release of the new version will maybe bring nothing more new (except some little improvement there and there), so don't wait huge difference or brand new features. But after that we will be able to have a more frequent release cycle which will introduce new features one by one / integrate
all the request all of you made on the wall / emails
So real time research / problem with tag autocompletion will maybe part of the first new release, but if they're not they will appears in the following weeks after this first release.

For more technicals details, I think i really need to start a "what's behind tatoeba" blog, to talk about geek stuffs ^^

Need some help?

Developers

About

Pinnwand-Narichten von sysko

Need some help?

Developers

About