Konu #20001 - Tatoeba

Menü

saeb 17 Temmuz 2014, 17 Temmuz 2014 tarihinde düzenlendi 17 Temmuz 2014 04:02:22 UTC, 17 Temmuz 2014 04:32:21 UTC düzenlendi

link

Kalıcı bağlantı

** GSOC Progress Report weeks 4-8 **

I'm sorry again that I haven't been writing more reports and keeping you
guys in the loop. But basically I was in a position were I broke everything
pretty badly and didn't just want to write a report every week that said
"I'm a complete failure and nothing works". Anyway, the backend is almost done
and I'm in a better position now, so let's talk about pytoeba's current status
and future shall we?

Done:
- User model centralization
- User private messaging
- Wall posts
- User voting
- User status/language web of trust
- python-social-auth integration
- haystack integration
- xapian-haystack indices for sentences, users, messages, comments
- tastypie integration for models
- basic tastypie integration for the python api (still needs work)
- major sql optimizations to the python api
- major numpy optimizations to the graphing backend on CPython
- alternative dependencies that are all pure python with no C required.
So we're now basically compatible with any python flavor or platform
with no extra compilation.
- iplus1 integration

Not Done:
- views (dropped during week 4 completely)
- js interface (there's still time for this and will be the top priority for the
rest of the time I have during gsoc)
- internationalization (well, no interface yet)
- multilingual stemming for haystack/xapian indices/queries
(unfortunately this needs an invasive patch into haystack's/xapian haystack's
code, so this won't be covered during gsoc)

So let's compare this with the original goals in my proposal which I'll copy
verbatim and mark with ✔ for complete support, ✘ for no support and ~ for
partial support:

Replicate the current functionality of the website:
✔ CRUD operations on sentences, comments, and wall posts
✔ logging of all operations on sentences
~ user profiles, status, and permissions
~ browsing of sentences, tags, and users
✔ messaging
✔ searching for sentences
✘ integration of transliteration
✘ internationalization of the entire website
Expand the website's functionality:
✔ a corrections framework
~ a corrections page with latest corrections that have been accepted, rejected, forced, or are overdue
~ uploading of audio
✔ advanced queries
✔ fetching translations up to the nth depth
✔ a user proficiency web of trust
✔ a user status web of trust
~ speed blocking/deletion of malicious users
~ a full fledged forum that can also be viewed as a wall
~ requesting words and translations
✘ sentence subscription system
✘ a user following system
✘ a whole new customisable user page that acts as a notification system for latest actions on subscribed sentences and latest actions from followd users, etc...
✘ rss and atom feeds for most pages that are essentially list views
✔ Enhance the database schema by using graph algorithms and/or integrating a graph database
Fully cache database queries and templates using memcached, and integrate an outside caching system such as varnish for server generated pages
✔ Build a set of python functions to manipulate the database and perform all functionalities that constitute an inner api, and use it in all view code and in future modules that extend the website
~ Build a RESTful api on top of the python api and django orm through tastypie
✘ Rewrite the UI to be crossplatform, dynamic, client-side, API compliant code.
~ Write a battery of tests that cover all the codebase
~ Provide administrative scripts for tasks such as importing/exporting db fixtures, importing/exporting csvs, adding new languages, extracting/updating mo/po files, search indexing, deployment on a development or production machine using vagrant/ansible
✘ Provide Help bot scripts that clean up corrections, among other things
✘ Provide an interface to all the admin scripts and help bot scripts where they can be executed manually or a cron job can be added and tweaked using the interface for them

This was done in ~8 weeks and still not close to covering everything. So I'm
essentially 2 weeks behind, actually 3 considering I'm still working on the
backend this week too. Testing stopped being a top priority, and code review
kind of died since liori was on vacation for like 2 weeks and had a family thing
for another week recently. Also the whole "huge patch" thing just made him less likely to give me any comprehensive feedback, so the feedback I got on the recent patches was a bird's eyeview of certain things I'm doing in the codebase that he doesn't like. But to really understand what happened, I'll have to walk you through all the things I decided to do after week 3, which is basically obsessing over adding more features, optimizing everything to death, and doing more interesting things like reading the codebases of third-party libraries instead of just making a website, which brings us to...

- The Road to Hell (featuring lool0 not dante):

The road to hell is interesting. At the gates you're told to abandon all hope
and in the deepest parts you watch satan eating brutus, cassius, and judas. And
at each level a sin with a poetic punishment.

- The Road to Hell and the feature craze:

Very early on I decided testing was making everything too slow, and decided
to keep adding as many features as possible with no regards to if anything
even worked. You can imagine this lead to major breakage, and it did to the
point that I made so many changes to the schema that every feature i developed
was not even manually testable, to manually test anything I have to isolate the
feature and use the last stable commit from week 3. I only started recovering
from this last week, but most tests still need to be readapted for them to
start passing. Only 6 of them pass as of the time of this report.

- The Road to Hell and the optimization craze:

You'd expect that having libraries that do everything for you would speed
development time so much you'd be done with everything in a week and then
you'd be sleeping for the next 11 weeks. But noooo... Enter the optimization
craze. Let's just dispell a great myth called the 'reusable app' first. In
the python/django world, there's this idea that it's great to invest time
in making these pythonic generic pluggable apps that can work with any other
django app out of the box. There's tons of these around and they can probably
do anything you can think of. You just need a few lines of configuration
and you're done, you get all of its features for free. This is the base of
the contrib package in django's core in fact. The auth system, the admin system,
etc...

- The Road to Hell and Reusability, Abstactions, and the ORM:
But let me tell you why python and django are always accused of not
being able to scale for shit. It can be summed up in one word: 'Leaky Abstractions'.
Django tries too hard to be pythonic and ends up wrapping the database in
a very pythonic abstraction known as the "ORM". What this does in the long
run is, the more generic the app is the worse the queries it will make for
you. At the heart of the problem is this thing called the ContentType system.
This basically allows things like "generic foreign keys" where you can have
a relation to a table dynamically without ever having to write and generate
the schema for it. This means more joins for every query. More work for your
database. Worse response time, and less scalability. It's the same problem
you get with multitable inheritance (which btw, some reusable apps i looked
at during the last 4 weeks actually used), death by joins.

It also happens to be the core reason behind all tatoeba's timeouts, inefficient queries involving joins that involve too many rows and the goddamn awful mysql that sucks at using indices, and sucks at joins to the point that 3 selects are better than a join (which really defeats the point of a relational database completely). But anyway going back to the story...

Instead of just using the damn apps and going to sleep, I was very curious
and every app I integrated or tried to integrate, I go ahead and read all
of its source code or at least the parts I care about the most. And of course,
you can imagine the horrors... So the end result was rip out or reimplement
core models and/or override classes if you can and include the package as is.
And that was the mantra. It took a week to look at userena, django-messages,
django-guardian, python-social-auth, etc... and adapt that and be happy with
the results, sort of. But this means anyone else trying to improve this code
will need to know the dependencies quite well, the python api will still be
usable without any of this knowledge though.

- The Road to Hell and python the turtle:
The last 2 weeks went to optimizing the scipy backend. You see i was getting
to know the graphing code of scipy better and then learned quite a lot about
numpy. And then realized... Python gets in the way and defeats lots of the
optimizations in scipy. So I went ahead and learned numpy properly and then
rewrote all the backend using numpy structures instead of python structures.
To say the least the results were amazing. 1 ms to redraw the entire graph
(3 million nodes and 6.6 million links). And 4 µs to add or remove a link
to the graph. I don't think I was ready for this so I tried to see if it was
possible to eliminate the Link table completely and was very disappointed
that I couldn't because I still needed to bookkeep nodes and links and build
the graph which after trying to optimize to death, took 5 secs and 70 mbs of
RAM. So I concluded it was possible but i needed to make a serious design change to what i store in the Link table so I can avoid link bookkeeping entirely
because it slowed shit too much. But I won't have time to implement this
during gsoc but it should be possible to only store information about the
connected components in a subgraph instead of links and eliminate link
bookkeeping entirely from the code. Of course, we can also eliminate the building of the graph everytime we need to calculate distances by just using a graph database like tatodb or orientdb just to map out the nodes. They build the graph once and use traversing algorithms instead of what I'm doing which is building the subgraph everytime and calculating the shortest distances (a really big hack on top of a relational db).

- The Road to Hell and raw sql on django:
Another thing I worked on for about a week was bulk operations. To give you
an overview, this is where we insert/update/delete lots of data at the same
time instead of one query each. It's basically the holy grail of relational
databases as they can do this quite well. They really scale with bulk operations. So the first obstacle is the realization that django's orm doesn't have any low level abstraction to build sql queries in a flexible way. This of course came after going through the Query, QuerySet, and sql compiling code in django's core, which also revealed another interesting problem, the bulk operations offered by the orm at the moment are broken. .delete() on the queryset does 2 queries, a select to get the ids and a subsequent DELETE on that. It, along with the bulk_insert are batched with a non-configurable batch size of 500 or so, so multiple queries will be issued that break your updates into 500 rows at a time. The .update() on the queryset is broken in a more interesting way. It updates one field in the exact same way on all the rows in the queryset. So there's no way to have true bulk updates where you need to change multiple rows in multiple different ways, a change per row or whatever. So I finally broke down and did the unthinkable. Build the sql query and parametrize it dynamically in python and issue it on the raw cursor connection to the db. This means you're on your own, mapping things to objects will be hard, django won't do it for you for sure. At the end of the week I had bulk_insert, bulk_update, and bulk_delete functions that did what I want and more and took a few milliseconds to affect hundreds of thousands of rows.

- Future Roadmap:

Ok now that you've put up with this wall of text let's talk about pytoeba's future. There's officially 4 weeks left in gsoc, 1 week for reviewing stuff, and 1 week for the final evaluation. Since tatoeba got some money already I don't really care about the gsoc timeline or evaluations and will just keep on working through these weeks, so that's 6 weeks left. I need a week to iron out some major problems in the api and a week to get more familiar with a shitton of javascript libraries. 4-6 weeks to finish the javascript part, 2 weeks to deal with other features that we really need like multilingual stemming and tests for everything and 1 week for proper high level documentation. So, let's set a date for an alpha release where you can all break pytoeba and a beta release where we slowly transition to pytoeba as the main platform. Alpha release on: september 15, 2014 and Beta release on: september 29, 2014. After GSOC is over, I'll move development from gerrit to github permanently and start adding proper tickets for everything that needs to be done, so anyone can join in the coding sprint towards the release dates.

P.S. Almost everyone else is done with the core of their gsoc projects. I'm negotiating integration with harsh and will need to take a closer look at pallav's scripts to integrate them as well. Also, pytoeba's backend compiles on android! so we can hope to support offline features once the angular app is ported to mobile using phonegap or appcelerator titanium, something which I'm very interested in doing. This should also work on ios and windows phone as python has been compiled successfully on these platforms as well but I haven't actually tried. Here's to a brighter future for tatoeba, cheers :)

----------------------------
intro message:
http://tatoeba.org/eng/wall/sho...#message_19536
progress reports:
week 1: http://tatoeba.org/eng/wall/sho...#message_19654
week 2: http://tatoeba.org/eng/wall/sho...#message_19768
week 3: http://tatoeba.org/eng/wall/sho...#message_19821
weeks 4-8: http://tatoeba.org/eng/wall/sho...#message_20001
-----------------------------
work log:
http://en.wiki.tatoeba.org/arti...work-log-lool0

patches pending review:
https://review.gerrithub.io/#/q...s:open+pytoeba

merged patches:
https://github.com/loolmeh/pytoeba

project template for testing:
https://github.com/loolmeh/pytoeba-dev

cevapları gizle cevapları göster

gillux 17 Temmuz 2014 17 Temmuz 2014 13:30:23 UTC

link

Kalıcı bağlantı

You learned the leaky abstraction problem the hard way. I think it’s a problem every developper (or even anyone creating things in order to ease others’ work) should be aware of. Here is a nicely written article about leaky abstractions http://www.joelonsoftware.com/a...tractions.html

Good luck for the future of pytoeba.

cevapları gizle cevapları göster

saeb 17 Temmuz 2014 17 Temmuz 2014 14:41:41 UTC

link

Kalıcı bağlantı

Thanks. Also that was a pretty entertaining read.

Hybrid 19 Temmuz 2014, 19 Temmuz 2014 tarihinde düzenlendi 19 Temmuz 2014 16:06:52 UTC, 19 Temmuz 2014 16:23:45 UTC düzenlendi

link

Kalıcı bağlantı

Thank you for your work. Do you think that you will be able to fix the Bad Gateway problem?

cevapları gizle cevapları göster

saeb 19 Temmuz 2014 19 Temmuz 2014 17:49:04 UTC

link

Kalıcı bağlantı

For the current website, no. Fixing it for good involves sinking a lot of time into optimizing queries and getting the current pages to be easier to cache, but I'm focused on pytoeba atm. I wrote about what needs to be done here:

http://tatoeba.org/eng/wall/sho...#message_19961

So if anyone has the time and necessary knowledge to make it happen it would be great.

cevapları gizle cevapları göster

Hybrid 22 Temmuz 2014, 22 Temmuz 2014 tarihinde düzenlendi 22 Temmuz 2014 02:38:59 UTC, 22 Temmuz 2014 15:55:37 UTC düzenlendi

link

Kalıcı bağlantı

Thank you. It's too bad because today Tatoeba was down for 7-8 hours (and I'm not sure it's over yet...) Edit: I see sentences from 24 hours ago at the bottom of the latest contributions, so that's about how long it has been down now.

cevapları gizle cevapları göster

saeb 23 Temmuz 2014, 23 Temmuz 2014 tarihinde düzenlendi 23 Temmuz 2014 01:04:41 UTC, 23 Temmuz 2014 01:05:45 UTC düzenlendi

link

Kalıcı bağlantı

long timeouts are out of our control (no amount of optimizations will fix this), someone will just have to talk to the FSF and convince them to move us to another machine that's not heavily used by them.

cevapları gizle cevapları göster

Hybrid 23 Temmuz 2014, 23 Temmuz 2014 tarihinde düzenlendi 23 Temmuz 2014 02:21:52 UTC, 23 Temmuz 2014 02:36:00 UTC düzenlendi

link

Kalıcı bağlantı

Thank you. Do you think that like maybe Google could give you a new machine? I know that they're helping you out and I think that they have a lot of servers to make Google work. Maybe they could spare one for Tatoeba?

cevapları gizle cevapları göster

saeb 23 Temmuz 2014 23 Temmuz 2014 03:33:03 UTC

link

Kalıcı bağlantı

We don't have any official relations with Google and as such can't demand anything like that. GSoC is a sort of short-term internship for students, that's the only connection we have with them through their community project manager.

cevapları gizle cevapları göster

Hybrid 23 Temmuz 2014, 23 Temmuz 2014 tarihinde düzenlendi 23 Temmuz 2014 04:56:03 UTC, 23 Temmuz 2014 05:03:06 UTC düzenlendi

link

Kalıcı bağlantı

I understand. I guess that we should be thanking the FSF because without them Tatoeba wouldn't exist. Thank you FSF!

JimBreen 26 Temmuz 2014 26 Temmuz 2014 04:37:22 UTC

link

Kalıcı bağlantı

What sort of server capacity does Tatoeba need/use? I rent a 1GB server (Ubuntu) from Rackspace for about $50/mo and for most of the time it's very lightly loaded. Given the chronically limping performance recently perhaps it's time to think of an alternative home for Tatoeba.

cevapları gizle cevapları göster

saeb 26 Temmuz 2014 26 Temmuz 2014 06:59:51 UTC

link

Kalıcı bağlantı

$50/mo sounds crazy expensive, so I have no idea what else you're getting for your money. I have a 1 gb ram ssd machine for $7/mo. Tatoeba at the moment has 3 gb ram and incessantly swaps an extra 1~1.5 gb so a 4 gb ram machine would be very comfy for the current codebase. That goes for $28/mo with my current provider. We'll probably do something about it early august hopefully.

cevapları gizle cevapları göster

JimBreen 26 Temmuz 2014 26 Temmuz 2014 07:31:20 UTC

link

Kalıcı bağlantı

OK. Sounds like you are seriously underpowered at present. I'd be glad to make a donation if the project has to start paying.

cevapları gizle cevapları göster

Silja 26 Temmuz 2014 26 Temmuz 2014 13:09:10 UTC

link

Kalıcı bağlantı

+ 1

tatoebix 20 Temmuz 2014 20 Temmuz 2014 08:33:45 UTC

link

Kalıcı bağlantı

On the database side , why still use mysql ? Use Firebird , have a look at the latest release 2.5.3 or the upcoming 3.0 release . We have databases with beyond 60 Million rows and up to 20 joined tables and indexes work fine too !
For local testing we loaded Tatoeba data into Firebird tables with separate views
for languages with more than 2000 sentences , all is working nicely.
Maybe something to contemplate for the future development cycle...

cevapları gizle cevapları göster

saeb 20 Temmuz 2014 20 Temmuz 2014 08:50:17 UTC

link

Kalıcı bağlantı

The same reason we're not using postgresql instead and I've taken the time to look into this, we just have lots of sql that would need to be ported because it's mysql specific. The easiest thing we can do is try to make use of mariadb's BNLH joins since we won't have to port any code and I've written migration scripts for this but it seems no one had the time to see them through on the server or benchmark the resulting setup properly. But of course if you have time to port things to firebird or postgresql in the meantime that would be great too. As far as pytoeba is concerned this won't be a problem in the future, the code is portable across any database backend that django supports or can be made to support with a third party app or a custom django backend.

cevapları gizle cevapları göster

sacredceltic 20 Temmuz 2014 20 Temmuz 2014 17:35:33 UTC

link

Kalıcı bağlantı

That's why database engines' agnostic frameworks have been invented...the framework should generate the SQL, not the programmers. I develop on MySQL and deploy on PostgreSQL. Vive la liberté !

cevapları gizle cevapları göster

saeb 20 Temmuz 2014 20 Temmuz 2014 22:49:00 UTC

link

Kalıcı bağlantı

It comes with a price though, I think rails has a less leaky ORM abstraction than django but still, there's probably gonna be better ways of writing queries that the ORM can't express. I think I might port some of the raw queries in pytoeba to peewee or sqlalchemy in the future since they have low level abstractions that tightly fit sql.

Menü

Yardım mı lazım?

Geliştiriciler

Hakkında