Perfil
Frasas
Vocabulary
Reviews
Lists
Marcapaginas
Comentaris
Comentaris sus las frasas de sysko
Cabinats
Jornals
Audio
Transcriptions
Translate sysko's sentences

Thanks to Jisoo Yi, we now have audio recordings in Korean, making it the 13rd languages to have audio recordings.

[some message on these thread have been redirected to private messages as it's the place where to fill complaint about other users behaviour]

for example one collecting all the audio files links and trying to download them all at once
Or trying to make a copy tatoeba by downloading automatically the webpage from the first sentence to the last at a very high pace.
Actually any kind of "lot of requests in a short amount of time"

Je peux essayer oui. Je pense à l'avenir proposer une page avec plein de stats sur tatoeba, mis à jour disons une fois par semaine, avec des stats assez poussé. Histoire d'avoir une meilleur appréhension du projet.
Dans l'absolu ça pourrait même arrivé assez tot, là je continue à àméliorer tatowiki pour une première beta.

Friendly reminder:
If you need to do some massive requests on tatoeba.org:
1- first contact me, so that I can see with you if there's not a better way to do this or if I can produce an output file formatted in a way that you can work offline with it
2- keep in mind we're not google or facebook, put a delay in your script between each request (2 seconds or more would be very nice).
Thank you.

your idea is in phase with mine :) see my post on the other thread
I know I've written it quite fast, and I'll try to rewrite it in a better and clearer way

I didn't see the other thread
so to answer at the time about the "split tatoeba in two" which my proposition for "rating" use
on the interface you will see nearly no difference, same website, same form. It just that your sentence would fall automatically as "not reliable" or "maybe reliable" depending of your "rank" (that you first set by ourself and that can be forced by administrators if any abuse)
1 - It's just after as you will never see the not reliable ones ,without explicitly looking for them, appears in search result or random sentence (we can still imagine a possiiblity to have an option to say "I want to see them in my random/search results even if not precised" and have an little icon to precise to which category the sentence belongs (not reliable/medium/premium)
2 - you will have a way to differentiate them too in the export files

For that specific point I don't think WE (tatoeba/tatoeba's user) should decide what is a language , what is not, where to split etc. because either you don't speak the language and well you can't decide, or you're a native speaker and you're maybe wrong about what are dialects and what are different language
A typical example being in China, most of people would set for example all the chinese languages in the group "dialect of mandarin" though for example the Shanghainese dialects does not belong to it.
At the opposite you would have people of two close cities putting their dialect as two differents ones though they belongs to the same language
And even if you get the tiny percent of user that are knowledgeable enough in both linguistic and that language, the "where to split" will IMHO put endless discussion and even if at one time people find a compromise to split in a way. What if two weeks later a new user arrive and propose, with insightful reasons, a new splitting ?
For that reason I think it's better to rely on a standard , iso 639-9 alpha 3 for languages, and latter iso 639-9 alpha 5 for dialects (note in that case dialects will be simply meta information, it will not be something that will replace the current language code but rather complete it)
Using a standard for the splitting also have the huge advantage of making tatoeba's data easily usable by other projects.
When thinking of that problems I was thinking of something
First to have a table field in the user information about what languages they speak and for it two fields
language1 level native? (I put native as a separate level and not as the highest level, as someone with a very low level of education can be native and have less commitment about the grammar than a foreigner who have work on that language for 30 years)
So that we would be able to do computation about it. (of course it would be possible for moderators to "force and lock" the level of someone if that person overevaluate himself)
When your level is less than a given level your sentence will go in some kind of waiting room, visible only to registered users and not exported so that it will not harm the quality of the course while still permitting people to work in pair
for example I don't speak a very good shanghainese, but it's easier to try to translate it and then ask a shanghainese native speaker to help me improve it, rather than to find a shanghainese native speaker who speaks french. Right now it would harm the quality of the corpus because waiting to find a correct translation, my approximative translation would be visible/exported.
Of course advanced contributors that are native speakers or with a level higher than X would be able to do a first validation to put that sentence in the main database
For the others it would go directly in the main database (basically the one we have right now) which would be of medium reliability. That one would get exported as it's reliable enough for most of usage. If at anytime a sentence seems weird, they could be put in the "not reliable database"
After you will get corpus maintainers that will be able to put the sentence from the main database and with one button to put it in a "certified" part. of course we can discuss about if it need one corpus, or two or three or more complex rules to "validate" a sentence
Don't worry for the "separate" database I already know how to easily implement that in the back office
I think that things is a good comprime, as
1 - we will be able to make the difference between 3 "reliability about quality" level
2 - we will still be able to keep the "rare and hard to prove" sentence
3 - still permit people to collaborate while keeping the quality intact
4 - avoid the problem of "idiocracy" has it would qualified user that will decide (and of course promotion to advanced contributors is given to people you trust enough to not be bigs liars, and even if they are not flawless the middle database is not advertize as "100%" reliable anyway)
5 - corpus maintainers will be able to slowly create a very "pure/premium" database at their own pace
So that we will have
1 - low reliability database (huge amount of data, may be useful for stasticical computation etc.)
2 - middle reliability (not that huge but reliable enough for "learners" , I mean I don't care if i see a missing "s" in a french sentence as long as it permit me to see the translation of a chinese sentence, also with maybe "rare" sentence hard to find proof about it)
3 - high reliability one (low amount of data but at least you're sure about it, useful for teacher etc. creating material for example , and keeping in mind you may not find in it "rare/ rare dialectal" construction)
Also for the last "level of reliability" it's open to discussion if you trust by default (i.e a corpus maintainers can alone put a sentence in it and you need to prove it's not correct to remove it), which is maybe suitable as we're supposed to have a high trust in corpus maintainers
Or a "guility by default" and the corpus maintainers need to prove the existence of the sentence (books, grammar reference etc. ) for others to validate it?
What do you think?

As Sacredceltic said, with the new Tatoeba, the goal would be to permit to add any kind of metadata to all the sentences like, so in the future the JMDict indices may fell in this category.

Actually the situation is the following, context provided
At the very beginning, the Tanaka corpus the original 150 000 English/Japanese pairs were handled by an other project, and one of their "work" was also to maintain that index for the JMDict/Edict project.
In order to be sure Tatoeba always had the last correction made on that corpus (for that part I'm no more clear who takes the initiatives as it was something handled by Trang) they (or us, or both) said that I would be better if Tatoeba became the "home" of that corpus and responsible of the correction. and the people at that time who were doing the correction was also in charge of the Japanese indices, so they wanted of course to not have their work splitted on two interface/websites
So Trang imported in tatoeba the indices and coded the tools needed to edit them.
So actually these indices are already there etc. and as you can see, it never add any impact on Tatoeba itself. They continued to do their jobs, and us ours, just that now for the "sentence" database, we're working on the same one.

So is this something I can handle on my side by changing the css / using css3 and proposing a font?

that's quite a good piece of news!

Actually your example with the html entity is not the same as the previous ones as you use a thinspace (U+2009) and the other sentences use a narrow non breakable space (U+202F)

si ce sont des applis natives, elles peuvent certainement embarqué leurs propres polices.
Sinon comme il a été fait remarqué plus haut, on peut a présent avec CSS3 (qui n'était pas encore un standard à l'époque ou l'on a développé le code de tatoeba), aussi "embarqué" la police dans le code css. Ce qui devrait permettre de résoudre le problème, en embarquant pour le français une fonte capable de rendre cette espace.

on android too, because the operating system has a much more limited set of fonts and has they are made by US companies, they didn't take care about that...

sure, that did go out of my mind while posting the answer :)

actually that can be done by doing something like this (on my side, I'll do some test on the new tatoeba code)
<span class="xxx">I'm eating an apple</span>
with xxx being the language ISO code, so that it would be possible to override the font on a "by language" basis, I think that the most elegant solution

Sure, of course I'm not a bot and I will not blindly delete (well it's actually "move as private message", so that you can amend your post) every single message as soon as it contains a username in it. But I think it's this way clearer and easier rule to remember, and after well I'm the one applying it so I hope people trust me enough to know I'll try my best to not apply it in an excessive way.
For example of course I'm not going to remove a message from the wall if there's people talking about a userscript and one user is saying "It seems to me the user XXXX did find a workaround for this problem" etc.

Well let me explain you
I don't plan to ban a user just because some people get problem with him or her. I got more than a dozen of user that I get reported to have problem discussing, collaborating with other over the now 5 years I'm on Tatoeba. I prefer to take the time to find solution through the dialogue. Unfortunately the wall is not made for that because it creates a "I need to reply fast" reflex.
The discussion that get censored would have needed my opinion as administrator (or as administrators if in the future we're several administrators), and I'm not able to follow the pace of the wall discussion, so I'm not forbidding the discussion on tatoeba, I'm forbidding it on the place where they are not appropriate.

I don't think 90% of the user want to have the wall full of message that concern only two people discussing to each other. Especially when it's going nowhere, none of the two being ready to be convinced by what the other is saying.