clear
swap_horiz
search
saeb
2014-05-18 16:26 - 2014-06-12 09:06
*Reintroduction, ramblings about Tatoeba, and my plans during GSOC*

Hey,

It's been quite a while. Some of you might still remember me, I hope. (my IRC handle is lool0 btw) For those who don't, I joined tatoeba back in 2010 and I've contributed a bit to the arabic corpus. Things were very different then. It was packed with students from all over the world who really were enthusiastic about free and open source software and data, enthusiastic about other cultures and languages, and enthusiastic about the possibilities that this project represented and stood for. Excitement was in the air, we'd have fun discussions in the comments or on IRC that end up as sentences, we'd create series of them and people would get busy translating those pieces of works into all the languages we knew, we paired up and added sentences for each other, we'd even have a "tatoeba day" every once in a while where we'd improve the entire corpus quite a bit and set new stat records, we had weekly updates, FSF servers weren't slow and sucky, we didn't have hundreds and hundreds of senteces that needed to be corrected and changed (yeah we had some relatively active moderators), and Trang and Sysko were part of *us*. At some point though Trang disappeared, Sysko got pretty busy and semi-disappeared, and other users either graduated and got pretty busy or got themselves into pretty bitter arguments with other users and decided tatoeba wasn't really the place they thought it was and it just wasn't worth it. The ownership and user moderation models really started to show their defects, a good chunk of the contributions that came in would be just locked in by accounts of users that would just not check back again or leave for good. There just wasn't enough moderators to take care of all the corrections and comments would just get lost for months. Using tags for this purpose was pretty taxing, not everyone had access to it, and tagged sentences would just pile up and not get dealt with anyway. The infrastructure of development and deployment also showed their weakness. There was no updates for *months*. Things would break or crash and no one would be available to fix it for days on end. Quality of the english and japanese corpuses also were never dealt with, the shadow of the tanaka corpus still haunts us to this day it seems. I grew pretty disillusioned. Most of my corrections on the arabic corpus at that time got lost and never applied. I decided to retire my account at some point, but remained in the shadows. There was a small group on IRC that was interested in japanese, anime, politics, free software, etc... And we weren't really affiliated with tatoeba and did our thing. They got me interested in linux and programming, and we occasionally added sentences to tatoeba through a bot that I wrote. We also had a great parser that qdii wrote that I managed to easily make accessible for searching over IRC. We only started getting involved with tatoeba quite recently, after the crash. I've had a bit of an argument with Trang at the time about how barely anyone has access to the server and the codebase isn't that well documented for anyone else with no familiarity with it to step in and do some sort of maintenance on it. Trang and sysko were kinda responsive tbh. We were kinda able to get them to use their personal e-mails to sign up to the new dev mailing list and the newly migrated github repo. We approached alanf and he was very happy to get more involved even though I'm pretty sure he has a soul crushing job. Things *are* starting to look better. We applied to GSOC and got accepted. This is very significant. Not only will we actually get some sort of development going (and hopefully all the students will stick around), but we'll also have some funding and some visibility in the open source community. And hopefully we'll get some better server soon enough, and use the FSF server for something else (they have really been sucking those past couple of months, lots of stole CPU cycles and terrible disk I/O). And y'know what, Tatoeba deserves better, and if no one gets involved it will just keep on deteriorating like it did in the past. I can't stress this enough.
I've been pushing for a rewrite for quite sometime. Wrote a couple of toy prototypes, discarded them, etc... But never really found enough motivation or time to carry it through. GSOC was a perfect opportunity. At least there will be enough public pressure, review, and support to push me through such a big project. You can read my proposal here https://dl.dropboxusercontent.c...al-public.html
In a nutshell, the major problems that I think we should tackle at the moment are:
-The codebase should be tested, documented, and written in a higher level language and framework. This would at least lower the entry barrier for development.
-Corrections in the comments and application of them by hand by handful of moderators just isn't gonna work. We need a new framework so corrections can be added in a special field and applied automatically if the owner does not respond at all.
-The current model of having a special person that hands out user statuses is also severely limiting. There should be a web of trust model where users can confirm a user's competence in a language and his competence in understanding the context and workflow of the project. So for example there would be a special button where trusted users and moderators can hand out votes and at a certain point that user would automatically become a trusted user in a certain language.
-The current database system and structure doesn't allow us to display more levels of translations and it's already the part that's straining the server. We absolutely need a graph stucture or database. Treating the data as flat tables just isn't gonna work or scale in the long run
-Deduplication should be done on the fly. This is a technical implementation issue and should be possible with a hash of the sentence. This is currently a major maintenance burden and a source for countless inconsistencies in the database(strange disappearing links, audio, etc...).
-We really need an API and this should get fixed ASAP. If we ever hope to have a mobile app for tatoeba, we'll have to have an API.
-We should have a system in place to request sentences and help people pair up their efforts.
-We should have a system where trusted users and moderators can flag a certain user as a vandal and his account would be disabled.
-We should at least have some sort of experimental rating system in place at least to gather some data at first. If a statistical bayesian model worked for spam filters I don't see why it can't work for tatoeba.
-Audio contribution should be embedded in the interface. If we ever hope to have more audio contributions, this just has to be done.

My proposal tries to deal with all that by setting out a blueprint and schedule for reimplementing current features and adding the ones I discussed above (and more). I know it's pretty ambitious. And I hope you do follow the development. I'll be posting about what I managed to achieve at the end of every week, and you can follow the commits here: https://github.com/loolmeh/pytoeba , we'll also be using gerrithub for code review, (you can probably get an account there and search for the repository and add it to your watch list). liori and Trang will be mentoring me and I hope I'll not disappoint any of you. Hopefully I'll be able to deploy something in a couple of weeks and ask you guys to wreck it for me, so stay tuned :)

----------------------------
intro message:
http://tatoeba.org/eng/wall/sho...#message_19536
progress reports:
week 1: http://tatoeba.org/eng/wall/sho...#message_19654
week 2: http://tatoeba.org/eng/wall/sho...#message_19768
week 3: http://tatoeba.org/eng/wall/sho...#message_19821
-----------------------------
work log:
http://en.wiki.tatoeba.org/arti...work-log-lool0

patches pending review:
https://review.gerrithub.io/#/q...s:open+pytoeba

merged patches:
https://github.com/loolmeh/pytoeba

project template for testing:
https://github.com/loolmeh/pytoeba-dev
hide replies
Trinkschokolade
2014-05-18 17:34
Welcome back, I guess? ☺
hide replies
saeb
2014-05-18 17:46
Thank you. Glad to be (sorta) back.
hide replies
marcelostockle
2014-05-18 20:47
Welcome back.
Did you know the reason of my avatar being a snake is that with you, Shishir and a couple other members we get a significant part of the Chinese zodiac?
hide replies
saeb
2014-05-19 06:14
Hah it sorta crossed my mind. Would be cool to get the 4 mythical creatures represented too :D
odexed
2014-05-18 23:35
!أهلا وسهلا ومرحبا

Don't leave our Arabic corpus please :)
Inego
2014-05-19 01:30
The list of problems you're going to solve looks so cool that it resembles an election program. And as such, it seems simply unbelievable :)
Inego
2014-05-19 01:33
The only dubious item here is the embedded audio contribution. I fear that will lead to an influx of bad quality audio to the site.
hide replies
CK
CK
2014-05-19 02:48 - 2014-05-19 02:49
+1

Other sites that have online recording, often have poor audio quality and lots of background noise.

It's also quite easy to use the Shtooka Recorder to get good quality recordings, properly named, very quickly.

Anybody interested in helping the Tatoeba Project by adding audio should read http://bit.ly/shtooka and watch the videos showing you how quickly high-quality recordings can be made.
hide replies
al_ex_an_der
2014-05-19 02:54
If it were simple more would use it.
Inego
2014-05-19 03:34
al_ex_an_der is right. The problem is not with Shtooka itself (which is a very convenient and efficient piece of software), but with the in/out "interface" of audio recordings in Tatoeba. To use Shtooka to its full extent, one has to be a sort of geek. It is difficult for a "normal user" to prepare the list of sentences for pasting into Shtooka, then to convert Flac or Wav or Ogg files to MP3, then to upload them somewhere, then to ask admins to import the files to the project.
Maybe solving *these* problems will help more than making embedded audio recording.
saeb
2014-05-19 06:18
Audio quality is a separate issue imho and could be solved by having all the audio that gets added show up on a pending list, where trusted users in that particular language and moderators can add votes or whatever to them so they would get approved at a certain threshold and become available on the site.
hide replies
Inego
2014-05-19 06:55
If so, you will have to implement an option for advanced users to use Shtooka + uploading instead of the basic embedded audio recording. I doubt that embedded recording offers advanced editing instruments like noise reduction.
hide replies
saeb
2014-05-19 07:28
Actually I'm not totally sure how I wanna handle the quality issue at the moment. Maybe we can find or come up with a minimal javascript based solution for basic cleaning up. Maybe provide an option for upload of audio through the interface. Maybe we need to hack the c++ swac-recorder so it can directly upload to the website through the API. The thing I wanna focus on during GSOC at least is have the basis in the database and API for getting audio posted in directly. We can figure out the harder parts later.
saeb
2014-05-19 07:32
And yes, the current way of adding audio is even pretty terrible for admins behind the scene, records have to be added manually through sql scripts and the files converted and copied to the static directory. Not even a web interface for batch insertion.
neron
2014-05-19 11:26
Hi, saeb
I am very glad for what you are doing here. All those problems become evident to me, even after just few days on Tatoeba. Nice to see someone with much more experience is about to start to actually do something about it. As a PHP developer, I am tempted to join you, but I can't at the moment. However, you should know - if you just start fixing foundation of this beutiful buildig, you will soon get army of developers, since I consider Tatoeba one of the World best projects, and it does deserve much better than it is now able. Only Trang can't do all almost alone. I say this wih huge respect and admiration about Trang.
hide replies
saeb
2014-05-21 06:56
Thanks for the kind words.

> Nice to see someone with much more experience is about to start to actually do something about it.

I'm actually not experienced or anything ^^", but I'll try to do my best (afterall GSOC was made for students to work on real projects that may or may not be at their level).

> As a PHP developer, I am tempted to join you, but I can't at the moment.

It's a python project actually ^^"", but you're welcome to contribute small patches to the current codebase. There's lots of low hanging fruit that you can probably work on without too much of a time commitment https://github.com/Tatoeba/tatoeba2/issues
saeb
2014-05-21 22:16
I setup a work log for more transparency:
http://en.wiki.tatoeba.org/arti...work-log-lool0

You can also check out the review process here:
https://review.gerrithub.io/#/q...s:open+pytoeba

Changes that got merged in will appear here:
https://github.com/loolmeh/pytoeba

You can also help test it by using this repo:
https://github.com/loolmeh/pytoeba-dev
Hybrid
2014-05-22 05:53
"So for example there would be a special button where trusted users and moderators can hand out votes and at a certain point that user would automatically become a trusted user in a certain language."

I like this idea. There could be a button like this: "Vouch for Tom's competence in German!"
hide replies
al_ex_an_der
2014-05-22 08:08 - 2014-05-22 09:02
De omnibus dubitandum est. Pri ĉio endas dubi. An allem ist zu zweifeln.

Let's be realistic. A user can be really trusted — if at all — only in a language he or she has been constantly using in everyday life, for many years. For the most of us, this means in their native language. I don't believe we can change this by pressing a button.
hide replies
Selena777
2014-05-22 09:28
It works for most people, but not for all. Some people have more than one "native" language, which they use in their everyday life. Some people don't use their native language in their communication anymore. Also, user can create an account and point any language like "native" (for example, the language of their ethnic origin, which he/she associates themself with, even if he/she isn't really fluent in it), etc, etc. So, having the list of "trustworthed users" in every language is not a bad idea.
Only thing, I think, it should be created by advanced contributors and verified by corpus maintainers for every language.
saeb
2014-05-23 18:52
> I don't believe we can change this by pressing a button.

Does this really matter? If enough people who know german really well think that your german is too atrocious for any of your contributions to be included into the main corpus, we shouldn't allow any of your past or future german contributions to be included. Period. However they can help you get some sentences included into the main corpus one at a time.
Eldad
2014-05-23 21:08
Welcome back, saeb. It's quite a pleasure to see you back among us.
hide replies
saeb
2014-05-23 22:28
Thank you. Glad you're still around too :)
sacredceltic
2014-05-24 18:33
>If we ever hope to have a mobile app for tatoeba, we'll have to have an API.

Why not just make a responsive site that fits smartphones and tablets ?
A few modern frameworks enable just that.
Geeks love APIs. Users just love consistent use across devices...and nowadays, mobile devices are just the norm, not the exception.
hide replies
qdii
2014-05-24 20:34
OK here are some reasons:
- interface: mobile applications' interface are often more practical than website. Even mobile version of websites are terrible. I don’t know any website I like to surf on my smartphone.
- offline version of tatoeba
- bandwidth: usually we pay more for internet data on mobile phone than on a wifi connection. So if we can spare users the cost of downloading all the website's assets when they just want to do a research
- responsiveness: click on search, wait x seconds, have your result. That's a website. Mobile apps can show results almost instantaneously.
- updates: you could choose when you want to update your sentence database, few people really need the lastest version. On a mobile app, you could have an "update" button you can click when you are connected to a wifi.

Also, by creating a simple API, we let people that have too little knowledge of php/mysql help us. Mobile app developers, desktop developers, etc. can join: Geeks love APIs… and we need geeks.

I have not even talked about the discrepancies between the different versions of the browsers and how they interpret CSS and render it on a mobile phone screen. Coding a one-fit-all website is nightmare.
hide replies
sacredceltic
2014-05-24 20:47
> I don’t know any website I like to surf on my smartphone.

Ben alors tu ne connais rien au web. Commence par Twitter..,

> offline Tatoeba

Il n'y a plus que les Pygmés qui soient hors ligne. J'ignore dans quel bled tu vis, mais le réseau est partout ailleurs et la vitesse d'accès est multipliée chaque année...
hide replies
qdii
2014-05-25 18:16
Actually, the question is: where do YOU live so that you can be always online? I get disconnected all the time: whenever I am on a plane, when I am on a train and I enter a tunnel. Even walking in a street, I get disconnected from 3G and have to switch back to Edge. The drop in the quality of service makes pages take seconds to load.

Sure enough, mobile phone's bandwidth has increased over years, but the volume of data you download from the internet too: people no longer read website, they stream Youtube videos on your telephone, they not only send emails, they also place video calls through skype with their relatives, and as far as I know there is no country in Europe that offers a true, unlimited, fast data plan. Not in the US either. I live in Barcelona and I pay about 80€ a month of Internet data plan and I still overconsume.

Funny fact: an employee of Microsoft, about two years ago, used the same argument as you did: he said on Twitter that nowadays everybody was always connected to the internet, so it was only logical that the Xbox should also be. A couple days later, Microsoft had to formulate a public apology. ( link: http://www.ign.com/articles/201...e-deal-with-it )

Offline is obviously more convenient: loading from disk is a lot faster than loading from the network. Especially when you are in a PLANE or in a TRAIN.
hide replies
sacredceltic
2014-05-25 19:04
If you overpay 80€ a month for your data plan, you're just a pigeon...
I travel in Europe every week and that takes me a maximum of 4 hours twice a week. That might sound a lot but that's just 2.40% of my week time and, even then, if by train, wifi is available...
The rest of the time, I'm on 3G/4G and that costs me under 60€...

You're actually overblowing the need for offline to a ridiculous point.
By the time you develop your brilliant APIs, our children will have 100G..,
Either you think connected or you're bygone.
hide replies
Impersonator
2014-05-25 19:44
Desktop applications is not the only thing API can be useful for.

API applications can be useful to embed Tatoeba data into other online applications, too.
hide replies
sacredceltic
2014-05-25 20:16
I don't care. I use Tatoeba...what else ?
qdii
2014-05-25 21:08
Alright sacredceltic, you live in a world where you are connected to a 4G connection all the time and where trains have wifi. I just don’t.

We need an API because it lets different sort of developers join the project. And we need more people who actually code and less people who live in perfect worlds.
hide replies
sacredceltic
2014-05-25 21:42
I would happily code for Tatoeba, but certainly not on a last century's framework and with a last century mentality...
hide replies
saeb
2014-05-25 23:45
What's so last century about having an API? Why do you think companies like google attract lots of developers around their tools and products? While companies like amazon struggle with the platforms they have like the kindle. It's pretty simple, you wanna ease a developer's life and make creating apps as easy as possible. You get them hooked on your stuff. And you do that by doing most of the work for them, so they don't have to implement all this functionality from scratch. What's the point of making every developer that wants to use a small part of tatoeba reimplement the entire stack inside his app. It's a waste of his time and a loss for us.

Also, see all of these one page dynamic JS interfaces? You can forget about them. Server generated pages just aren't made for them and you will suffer to make them work without having a clean API.

You can't predict how people will use your site or how your site will grow over time. Suppose, you wanted to get more servers and keep them all in sync at some point. You can of course choose to suffer and do load balancing on every part of your stack, the database, the application server, the webserver, etc... Oh wait you have different schemas on some servers? too bad you'll have to completely drop it or suffer to keep it in line with the others. Oh one of your webservers is out of order? too bad you lost that node. Oh you don't know how to make your load balancer ignore that node when it's not available? too bad you have to do it manually. Compare that to having an api that can (by definition) take incoming data from other instances of itself readily (they all speak the same unified language). You'd only have to load balance just this part of your stack. People can go ahead and make some monstrosity of a schema that you never even imagined could exist and they can still be compatible with your server.

Suppose I only wanted to display some example sentence on this dictionary I'm working on. Is it fair for me to take tatoeba's data every week, filter it, import it into a database schema, implement a complete searching functionality on top of that, and maintain all this mess weekly. What if I wanted to allow my users to add their sentences too? Is it fair for me to manually filter out the new sentences and send it by mail to tatoeba's devs for mass import every week? This is not scalable, and not maintainable, indeed it's the very definition of "last century".
hide replies
sacredceltic
2014-05-26 10:32
We don't need all that sophistication just to insert, update and read sentences from the service.
If these functions had been designed as web-services to be reused by other applications, that probably would be great, but I don't think that that's what most users are awaiting.
What we need is just a no-nonsense UI that fits both mobile and non mobile devices. Simple.
HTML5 and CSS3 are suited well enough for that.

I have been here for years hearing about wonderful new versions with fantastic functionalities that never materialised. I can't see why this would change.
hide replies
saeb
2014-05-26 13:15
> We don't need all that sophistication just to insert, update and read sentences from the service.

It's not particularly hard or terribly sophisticated to make an API. As long as you don't make a spaghetti mess in your controller and factor out most of the functionality into helper functions. Then an API is all about just exposing those helpers. Besides there's great API frameworks now, that give you a free API just by writing the database schema...

> What we need is just a no-nonsense UI that fits both mobile and non mobile devices. Simple.

Right. And having an API simplifies making such a UI as well. Ajax requests that improve usability just don't mix well with rigid server pages they need API endpoints to make your life easier.

> HTML5 and CSS3 are suited well enough for that.

Simple enough? Yes. Usable? No. When it comes to usability, you either write javascript or watch your users flee.

>I have been here for years hearing about wonderful new versions with fantastic functionalities that never materialised. I can't see why this would change.

Y'know, if we sponsored trang or sysko to work on these wonderful new versions full-time they would've been out a long time ago. It's never been about competence, time or complexity of the task.

I'm sponsored to do this. This is a job. I get evaluated every week. If I don't deliver, tatoeba doesn't get any money and I don't get any money, and tatoeba will probably not get sponsored next year. So there's lots of ass on the line too.
sacredceltic
2014-05-24 20:51
>Coding a one-fit-all website is nightmare.

Parce que tu utilises des frameworks du siècle dernier. Il y a moyen d'abstraire tout ça. Moi je fais ça...
saeb
2014-05-24 20:35
Hi,
Half of my proposal is just about this. An angular UI (it uses a reimplementation of bootstrap in native angularjs directives). It sure is a convenient cross-platform solution (although won't ever have a native feel the way that apps written for the platform would and sure won't have the same access to hardware capabilities). But again we can't predict every use case of tatoeba and should give people the ability to create native apps that fit their requirements on any platform without too much complexity.