Wall (6,292 threads)
Before asking a question, make sure to read the FAQ.
We aim to maintain a healthy atmosphere for civilized discussions. Please read our rules against bad behavior.
13 minutes ago
9 hours ago
2 days ago
I think it would be better to use <textarea> instead of <input> when editing sentences, because of very long sentences.
The word-wrapping would be welcome, and using the same font for both the display and editing would be great.
And, at least in Firefox on Ubuntu, the spell checker is activated only on the textareas.
I agree too, but just to say you can activate it also for input by going on
about:config and changing the value layout.spellcheckDefault to 2 :)
Karaj Esperantistoj! Mi havas demandon pri nombroj:
Ĉu oni skribas ekz.
"Estas la 9-a horo" aŭ
"Estas la 9a horo" (sen streketo)?
Mi trovis ambaŭ variantoj en tatoeba, eĉ de la sama aŭtoro.
Sizinle tanıştığıma çok memnun oldum.
Turkish sentence meaning in English.
I am delighted to have met you.
You can add sentences here:
Welcome to Tatoeba, dilbilimciler. Please go to "Contribute" > "Add sentences" (in the navigation bar near the top) to contribute sentences to Tatoeba.
ben de senle tanıştığıma memnun oldum, gerçekten çok iyidir. ben de senin gibi türküm. anadilim türkçe.
For what reason can't duplicates be removed automatically by a script?
it can, i've done one, but it's not as easy as it seems, because we have strong performance (both in time and place) constraint, as I'm sure you don't tatoeba to be slow as hell during even 7 hours.
So we need to do it directly in the database, the script is no long, but it need to update a lot of things, and for a weird reason there's a bug in the script, that i can't easily reproduce as it seems it's a bug in mysql itself, and it happened only when there's concurrency (as I can't stop tatoeba.org to activate the script) so that's why, and I've already spend too much trying to figure out "why" moreover this problem will disapear with the new version, so I will rather spend time trying to finish this new verion asap rather than continue to search how to fix the script.
All right, if it's fixed in the next version anyway, then go ahead. I just thought it is nonsense to find and remove duplicates manually, as some members did.
** Feature suggestion: Sentence linker **
Most of my contributions are translations from English and Danish into Icelandic. I regularly happen upon sentences that seem somewhat familiar.
For example, I will come across a sentence in Danish that I've already translated out of English. Since only direct translations are filtered out in the current language lists, these potential duplicates will occur.
I assume it would be fairly simple (and may even be on the agenda) to implement in the new database version, a list of non-direct links between sentences to allow contributors to link these.
Tatoeba day #2
(...and I'm posting the content below for those who are not able to view the post)
Tatoeba day #2 (Jan 23rd, 2011)
We have decided of a date for our next Tatoeba day, and it will be January 23rd, 2011. Just like the last Tatoeba day, it will start at 0:00 and will end at 23:59 (France time).
There will be 2 objectives for that day:
- Banners for Tatoeba
- Quality of the corpus
# Banners for Tatoeba
For those who are not sure what I'm talking about, what we call a "banner" is basically an image that represents a website. Let's say you are a fan of Tatoeba and have a personal blog. Because you are very supportive, you would like to put a link to Tatoeba on your blog, and perhaps you would like the link to be graphical rather than simple text. Well, we don't really have any standard image for people to use in such situations and we'd like to create some.
So we're organizing a little contest for our contributors with artistic/design skills (or who just want to give it a try): create banners for Tatoeba!
Everyone can participate and if you want to, here's what to do or to know:
1) Make 2 images with the following sizes: 88x31 and 392x72. You may re-use our current logo in it, but don't hesitate to make another (better) logo if you are inspired.
2) Send your 2 images to firstname.lastname@example.org with the title "Tatoeba banners" and indicate in the email your Tatoeba username. I will reply you back to confirm that I have received them.
3) The current deadline is January 23rd, 13:00 (France time). However, we will need at the very least 5 submissions, otherwise it's not very intersting :P I do hope there will be more than 5, but if there isn't been enough submissions, we will extend the deadline to the next Tatoeba day: February 20th, same time.
4) Shortly after the deadline I will publish the banners that were sent to me. Then Tatoeba users will have one week to vote for their favorite banners. I'm not sure yet how we will do the votes but I will write about it in due time. NOTE: I don't want people to be influenced by "who made the banner" during the vote so I will not indicate this information when I first publish the banners. I will ask you as well to keep your work "secret" the whole time (don't show it to anyone and don't say "I did this one").
5) Once the votes are over, I will reveal the participants and announce the winner who will then be venerated forever by everyone for his/her talent :)
# Quality of the corpus
Since Tatoeba is open for everyone to contribute, one of its biggest problem is quality. Contributors aren't necessarily professionals and we inevitably have many sentences that contain mistakes or don't sound right. For our 2nd Tatoeba day, we will be focusing on quality. The goal of the day will be to check, correct and improve as many sentences as possible.
We've got plenty of sentences sentences tagged "@Needs Native Check", "@change" and "@check", and it would be really nice to remove as many of these tags as possible to replace them with the 'OK' tag. We've also got plenty of orphan sentences that desperately need parents.
If you want to participate, don't be shy and join our IRC channel #tatoeba on January 23rd (cf. our help page to learn how to use IRC, in case you are not familiar with IRC). This way you can discuss in real time with other members about what to do with a sentence (among other things)!
The next day, I will be publishing the following stats:
- The number of sentences modified.
- The number of comments posted.
- The number of sentences tagged 'OK'.
- The approximate number of sentences adopted.
I'll be honest though, things might be a little disorganized at first. I don't know yet how many people intend to participate and I don't how yet how we will coordinate with each other to work efficiently together. But this second Tatoeba will be the occasion to experiment and hopefully figure out something :)
NOTE: You may want to read these articles to learn a bit more about how we handle quality, even though they are not up-to-date anymore:
So Tatoeba day has started :) There has been an update for this purpose: http://blog.tatoeba.org/2011/01...22nd-2011.html
You'll mostly want to check use these links for this day:
=> http://tatoeba.org/activities/adopt_sentences [Adopt sentences]
=> http://tatoeba.org/activities/improve_sentences [Improve sentences]
* Adopt sentences *
Wonderful to have this page! Now we have e.g.
166984 sentences in English, 89934 unadopted ones, so 77 050 adopted sentences.
153739 Japanese, 148574 unadopted, so 5 165 adopted sentences.
69662 Esperanto, 321 unadopted ones, so 69 341 adopted ones.
57269 French, 13516 unadopted ones, so 43 753 adopted ones.
43179 German, 10 unadopted ones, so 43 169 adopted ones.
25547 Spanish, 205 unadopted ones, so 25 342 adopted ones.
Sum of unadopted sentences now: 253 866, so about 1306 unadopted sentences in other languages - no real problem.
Probably much less unadopted sentences this evening :-)
252 327 orphan sentences now, -1539,
229 orphan Esperanto sentences now, -92.
So we made a little progress.
En Esperanto restas nun 5 orfaj frazoj, http://tatoeba.org/epo/activiti...ces/epo/page:1 ; eble vi pretas transpreni unu el la lastaj frazoj kaj iel modifi ĝin.
Aktuale estas 70 679 frazoj en Esperanto; minus 5 orfaj, do 70 674 adoptitaj frazoj.
Entute la nombro de orfaj frazoj reduktiĝis al 251 178. En la angla estas nun 167 325 frazoj, el kiuj 88 181 orfaj, do 79 144 adoptitaj frazoj; 8500 pli ol Esperanto.
Just to have an idea, at which pace the problem of orphan sentences is being solved and where we are now. There are at this moment:
247 487 orphan sentences (total),
148 532 Japanese,
84 937 English,
12 772 French (sum of these three: 246 241),
97 Spanish (sum of these first nine: 247 305; 182 others)
5 Norwegian (Bokmål)
1 Toki Pona,
0 Old East Slavic,
0 Egyptian Arabic,
about 95 orphan sentences in other languages.
If you want to adopt sentences and, perhaps, correct them: http://tatoeba.org/epo/activities/adopt_sentences/ and select your language.
Alright, so Tatoeba day #2 is over :)
Unfortunately, only 3 brave people sent me their banners so the deadline is delayed to next Tatoeba day on Feb 20th. (Where are the artists? :P)
In any case, thank you to everyone who participated today and dropped by our IRC channel! I'll be publishing the stats tomorrow :)
Willkommen bei Tatoeba koitsu.
Why do the "adopt sentences" and "improve sentences" pages not have the same features as the "browse by language" page. It would be great to have the filtering optinos there too.
I would also suggest, that when opening the "adopt sentences" page it would suggest you to adopt sentences in the languages you (probably) speak and not always suggest chinese sentences to non chinese speakers. As for now the users cant define what languages they speak (what i think would be nice for the future) one could at least suggest sentences in the language of the user interface one is using (the dropdown in the upper right).
Maybe we could "guess" the languages and their level of proficiency by the number of sentences added or owned by a user.
Both the "adopt sentences" page and tags pages allow you to filter by language. There is little use in filtering sentences based on available translations or audio.
Rather than guessing user's level of proficiency it might just be easier to store their preferences.
Well, I agree with you that one does not need filtering by audio for editing, but for learning one does.
Also, imagine someone like me speaks e.g. Esperanto and wants to adopt as many sentences as possible, but only if there's a german (my native language) translation available to double check; or someone would like to directly translate each sentence he/she just adopted into english, wich would be of little use if there already is a translation into english.
The filtering by language is such basic, that i would not consider it as a "filter" to the user, because without this it wouldnt be possible to work at all on sentences of rare languages: e.g. how should i find the two sanskrit sentences out of 700'000 total.
The guessing thing would just be a suggestion to support the user in choosing the languages he/she speaks. Maybe it could display a "do you want to add Slovenian to your languages" note in the profile editor if the user has exceeded a certain number of sentences. This could animate users to not leave their profile empty, which is annoying for other users working together on certain sentences. So of course this should be combined with advanced preferences.
PS: please don't feel like I'm always asking for more and more features. I understand that these have different priorities, but I feel the need to share my ideas so we can discuss them and find the most useful ones.
for "please don't feel like I'm always asking for more and more features" i will answer 'please don't feel like i'm not implementing them right now because you're annoying me" :p We really appreciate when user tell us what they would like to see in tatoeba, as anyway we're not doing that for money so the main reason we're developping stuff depends on "will users enjoy it?" so by telling us what you want you're making the task easier :p So feel free to share us all the ideas you have, even the craziest one otherwise i will not know if i'm able to code that or not :)
My appreciation for your work! :D I wouldnt be here if i didnt like it. I also didnt think of the performance issue trang mentioned. Also sometimes it feels like some websites inovations are just solutions to prior not existing problems, but nevertheless good and usefull. I'll just keep sharing my ideas as i trust your abilities to rate what priority a feature has.
Keep going all of you three!
OK, we seem to lay a slightly different emphasis on the scope of Tatoeba. While it is a very valuable language learning resource, I think we should be keeping the contribution and query interfaces separate, and leave the learning aspect for later or even completely to external services through the API.
I believe that at present there is no need for such a feature. As for recommending languages for ones profile, this just seems like a feature for it's own sake or a solution looking for a problem. I don't think setting one's profile is such a difficult task.
Feel free to keep throwing out ideas. :-)
Yes, learning should work more like a flashcard box. Right, a somehow standardized part for the language information would be cool. Have you already thought about how to design this?
What do you mean by a design for language information?
If you're talking about meta information on the sentences then, yes, I've actually given that a considerable amount of thought. The tags are a step in that direction, but it's still too far from an optimal solution. I consider this one of the most interesting steps toward making the database more useful.
This is a matter for a different topic, but my main thoughts are on how one can best tag sentence structures. Specific problems arise in complex, long or even multi-sentence sentences and when a single sentence fits multiple structures (e.g. when two grammatical forms are the same as is the case with "read").
Actually, filtering by audio on the adoption page would not be very useful because we don't record audio for orphan sentences :P
The translation filter may be useful, but I'm not sure if it's worth the cost. I mean, there would be performance issues if we implemented that filter. And I think that the absence of this filter can be compensated by the fact that once you adopt, you can see all the translations.
- You read a sentence, you feel it's fine, you adopt it.
- It shows the translations, you look at them.
- If there is no translation in the language you want, you can add a translation right away.
- If there are translations, you can check if they match with your sentence.
As for displaying by default sentences in the language of the interface, that can be done. It's actually what we are doing when you click on "Browse by language" so there's no reason we can't do this for "Adopt sentences" :)
Well, youre right. However its just one more click. ;)
Jemand, der Japanisch versteht, möge bitte mal diese Deutschen Sätze durchsehen, die unnatürliches bis fehlerhaftes Deutsch sind. Ohne Japanischkenntnisse kann man leider nur raten, welche Verbesserungen der japanische Satz zulässt:
hier vielleicht "aufheben" statt "behalten"?: http://tatoeba.org/deu/sentences/show/590121
As Ludoviko asked me in a thread about the progress of the new version, and yep it's true I don't communicate a lot about the progress (except always saying "well it will be possible in the new version) so basically
The sentence database is somewhat finish, and debugged (memory leak etc.) thanks to the help of Qdii. So what's already possible with it
* view all the translations of one sentence (even the 20th degree translation), so it will be no more a problem in the next version of tatoeba
* real time detection of duplicate (even when correcting a sentence) so here again
* perfomance improvement, if we only talk about sentence+translation retrieving (not talking about html generation etc.) it's very damn fast now, kinda 10 000 time faster for the complex queries we make in tatoeba, so here again I hope it will fix the "well i don't plan to add this or that feature because the server is already over busy"
Framework / website itself
I've said it on my twitter, we're moving from cakephp to a c++ framework, during some time we tought about django, but well seems me and biptaste are not made for this and django will have not solve the performance issue. It took us some times to started to be confident with it and set up the general architecture. But now it starts to works fine and we're going to reimplment pages now (that's one of the reason I didn't talk too much about it, because for the moment the progress we've made was mainly code stuff, so nothing "visual" tough it represent a huge part of the work)
For the geeks among us who're wondering if write tatoeba in c++ with an "obscure" framework is not some stupid decision that only increase the developping time, I will say no, in fact not so much.
1 - anyway we were about to learn a new framework, so I think an important part of the development time is spend into understanding the framework rather than "typing code", and even if the community for this framework is very small, it has one subtil advantage, the main/only developer is really accessible so we can ask him directly questions about our personnal problem, he's very reactive and we can be sure of the reliability of his answer (after all it's his project:p)
2 - As I've said one of the problem of the current Tatoeba is performance, we're not making money on Tatoeba, I'm still a student and Trang used to be, so we don't have money to spend on renting server etc. so as we're developing it for free, for fun, "how many times" we spend in developing it is not an issue, the real issue is hardware, so improving by 2 the perfomance means we can handle 2 more time users without needing new hardware / needing to spent more money (tough for the moment we're kindly host by the French FSF, but well they don't have illimited ressource, and we don't want to abuse of it). So making it now in c++ will assure us we will not need to do that in the future, so I think on the long term it will save us time/money
Also need to add that now all the feature we will add will have an api counterpart, so it will ease development of third party application using tatoeba.
So it's where we are so far, not so much "visual" stuff to present, but the motor is already on a good way, and i think it was the most difficult / "not rewarding" part. Oh forget to say we also spend some times to set up some collaborativ tools as a redmine, a git repository etc. on my server,
For the moment as I think the 3 main feature of "no limit in depth translation" and "real time duplicate detection", and "speed up" are the current 3 major problem of tatoeba. The first release of the new version will maybe bring nothing more new (except some little improvement there and there), so don't wait huge difference or brand new features. But after that we will be able to have a more frequent release cycle which will introduce new features one by one / integrate
all the request all of you made on the wall / emails
So real time research / problem with tag autocompletion will maybe part of the first new release, but if they're not they will appears in the following weeks after this first release.
For more technicals details, I think i really need to start a "what's behind tatoeba" blog, to talk about geek stuffs ^^
It means every tickets like optimalization and "the small things" should be fixed in the c++ framework; is it worth send a patch to CakePHP framework? Where will be the SVN of the new framework? >:D I like to learn new things, that is why I am asking.
the cakephp version will still live a month at least i think, so if you already have it, or if it's something not so complicated to do, yep you can submit the patch yep.
For the new version, yep as for the current, I really believe in open source, even for the code of websites, so yep it will be open to everyone under the same licence as the current one (AGPL), and as I really think we're not going to move to an other framework (except if a ASM framework exists :p, tough I'm not sure it can be faster than gcc optimized binary^^) so as soon as we get something stable and documented, part of our "duty time" will move from "coding" to "manage to have tools to permit collaborativ works also on the code itself, not only on the data", as myself I wished i could have some "open website" to study to learn how "real" websites are made. I really hope in a near future tatoeba will not only be a place to build an open corpus, but also a place to build open tools to exploit the corpus (and the website is part of them), for the greater good of common knowledge.
So you'll be using a newly developed database and writing the webpage in C++?
Not to say it won't work, but I hope you'll be keeping good backups. ^_^;
you'll see ;-)
Thank you very much, sysko. I am very glad that rather soon we will be able to "view all the translations of one sentence (even the 20th degree translation)" and have "real time detection of duplicate (even when correcting a sentence)". Really just marvellous!