menu
Tatoeba
language
Register Log in
language English
menu
Tatoeba

chevron_right Register

chevron_right Log in

Browse

chevron_right Show random sentence

chevron_right Browse by language

chevron_right Browse by list

chevron_right Browse by tag

chevron_right Browse audio

Community

chevron_right Wall

chevron_right List of all members

chevron_right Languages of members

chevron_right Native speakers

search
clear
swap_horiz
search
saeb {{ icon }} keyboard_arrow_right

Profile

keyboard_arrow_right

Sentences

keyboard_arrow_right

Vocabulary

keyboard_arrow_right

Reviews

keyboard_arrow_right

Lists

keyboard_arrow_right

Favorites

keyboard_arrow_right

Comments

keyboard_arrow_right

Comments on saeb's sentences

keyboard_arrow_right

Wall messages

keyboard_arrow_right

Logs

keyboard_arrow_right

Audio

keyboard_arrow_right

Transcriptions

translate

Translate saeb's sentences

saeb's messages on the Wall (total 221)

saeb saeb May 9, 2016 May 9, 2016 at 4:12:03 AM UTC link Permalink

I remember mentioning in a ticket a dialog box for warning the user about his intention would be eventually needed for the edititing case not too long ago ^^"

saeb saeb May 9, 2016 May 9, 2016 at 4:10:01 AM UTC link Permalink

ah I see what you mean now

saeb saeb May 9, 2016, edited May 9, 2016 May 9, 2016 at 4:01:06 AM UTC, edited May 9, 2016 at 4:05:04 AM UTC link Permalink

awful race condition, bill edited the sentence without unlinking it first, and it pulled in a different subgraph with all its links, and now your careless unlinking broke off the hungarian sentence from the other subgraph, what a mess

saeb saeb August 4, 2015 August 4, 2015 at 11:25:17 AM UTC link Permalink

Just a small note. If we wanted to use a bayesian model on this data it would be impossible to interpret it without knowing what the answer is explicitly about. Otherwise you wouldn't know what bias is being represented. So there's two flaws with the current design: are the questions understood in the intended way by the users answering and are the questions explicit enough to be interpreted meaningfully. Whatever data we collect would be virtually useless if we don't answer those two questions well first. So we might want to capture community bias on the prescriptivist/descriptivist spectrum or spoken/written spectrum or in-group/out-group spectrum, each would need their own set of very explicit questions.

saeb saeb June 1, 2015 June 1, 2015 at 9:44:57 AM UTC link Permalink

It wasn't actually, I was fidgeting around with the incremental option yesterday, it runs regularly as of today.

saeb saeb August 22, 2014 August 22, 2014 at 8:38:47 PM UTC link Permalink

This sort of reply is unacceptable. You're on a collaborative project and posting on a public wall. Expect criticisms and improvements on whatever you propose. No need to antagonize other users here or alienate them.

saeb saeb August 22, 2014 August 22, 2014 at 8:34:56 PM UTC link Permalink

I'll just reiterate what demetrius already said. This is sort of useless on the level of a sentence. I think we should probably treat this the way we treat alternative scripts and phonetic transcriptions right now, we can train a part of speech tagger and an inflicter for whatever language we want to support and then autogenerate some annotations in some field per word, and allow users to edit errors etc... then adding this field in an index would make it searchable, so you can even search for patterns that mix all kinds of cases and grammatical structures.

saeb saeb August 20, 2014 August 20, 2014 at 11:29:04 PM UTC link Permalink

Short answer: no.

Long answer: term based indexing just isn't compatible with wildcard searching. There isn't a search engine software that I know of that has this enabled by default or has a sane implementation of it. Though, theoretically, if you know where your search engine keeps the term list and in what format, you can definitely run some sort of regex search or whatever on it, then maybe limit the results to 10 terms or whatever and insert the terms into the search with OR as a separator.

saeb saeb August 14, 2014 August 14, 2014 at 11:06:14 AM UTC link Permalink

It seems to be working fine right now, I went ahead and switched nginx's config from unix sockets which don't scale that well to tcp sockets.

saeb saeb August 9, 2014, edited August 9, 2014 August 9, 2014 at 6:14:15 AM UTC, edited August 9, 2014 at 6:20:47 AM UTC link Permalink

I tried looking at this in the morning to see if a small bash script can just generate and insert the necessary html at least in index.ctp but realized quickly that I have no clue how to do this, since the header is in default.ctp that all views use so the current path and the list of interface languages have to be passed in some variable to some element that can be used in default.ctp

anyway here's the relevant tickets so far:

https://github.com/Tatoeba/tatoeba2/issues/243
https://github.com/Tatoeba/tatoeba2/issues/393
https://github.com/Tatoeba/tatoeba2/issues/394

saeb saeb July 26, 2014 July 26, 2014 at 6:59:51 AM UTC link Permalink

$50/mo sounds crazy expensive, so I have no idea what else you're getting for your money. I have a 1 gb ram ssd machine for $7/mo. Tatoeba at the moment has 3 gb ram and incessantly swaps an extra 1~1.5 gb so a 4 gb ram machine would be very comfy for the current codebase. That goes for $28/mo with my current provider. We'll probably do something about it early august hopefully.

saeb saeb July 23, 2014 July 23, 2014 at 3:33:03 AM UTC link Permalink

We don't have any official relations with Google and as such can't demand anything like that. GSoC is a sort of short-term internship for students, that's the only connection we have with them through their community project manager.

saeb saeb July 23, 2014, edited July 23, 2014 July 23, 2014 at 1:04:41 AM UTC, edited July 23, 2014 at 1:05:45 AM UTC link Permalink

long timeouts are out of our control (no amount of optimizations will fix this), someone will just have to talk to the FSF and convince them to move us to another machine that's not heavily used by them.

saeb saeb July 20, 2014 July 20, 2014 at 10:49:00 PM UTC link Permalink

It comes with a price though, I think rails has a less leaky ORM abstraction than django but still, there's probably gonna be better ways of writing queries that the ORM can't express. I think I might port some of the raw queries in pytoeba to peewee or sqlalchemy in the future since they have low level abstractions that tightly fit sql.

saeb saeb July 20, 2014 July 20, 2014 at 8:50:17 AM UTC link Permalink

The same reason we're not using postgresql instead and I've taken the time to look into this, we just have lots of sql that would need to be ported because it's mysql specific. The easiest thing we can do is try to make use of mariadb's BNLH joins since we won't have to port any code and I've written migration scripts for this but it seems no one had the time to see them through on the server or benchmark the resulting setup properly. But of course if you have time to port things to firebird or postgresql in the meantime that would be great too. As far as pytoeba is concerned this won't be a problem in the future, the code is portable across any database backend that django supports or can be made to support with a third party app or a custom django backend.

saeb saeb July 20, 2014, edited July 20, 2014 July 20, 2014 at 2:11:15 AM UTC, edited July 20, 2014 at 2:13:53 AM UTC link Permalink

I've answered this here:

http://tatoeba.org/eng/wall/sho...#message_19961

The page you linked happens to be one of those that have a heavy query and the server was crapping out pretty badly today.

saeb saeb July 19, 2014 July 19, 2014 at 5:49:04 PM UTC link Permalink

For the current website, no. Fixing it for good involves sinking a lot of time into optimizing queries and getting the current pages to be easier to cache, but I'm focused on pytoeba atm. I wrote about what needs to be done here:

http://tatoeba.org/eng/wall/sho...#message_19961

So if anyone has the time and necessary knowledge to make it happen it would be great.

saeb saeb July 17, 2014 July 17, 2014 at 2:41:41 PM UTC link Permalink

Thanks. Also that was a pretty entertaining read.

saeb saeb July 17, 2014, edited July 17, 2014 July 17, 2014 at 4:02:22 AM UTC, edited July 17, 2014 at 4:32:21 AM UTC link Permalink

** GSOC Progress Report weeks 4-8 **

I'm sorry again that I haven't been writing more reports and keeping you
guys in the loop. But basically I was in a position were I broke everything
pretty badly and didn't just want to write a report every week that said
"I'm a complete failure and nothing works". Anyway, the backend is almost done
and I'm in a better position now, so let's talk about pytoeba's current status
and future shall we?

Done:
- User model centralization
- User private messaging
- Wall posts
- User voting
- User status/language web of trust
- python-social-auth integration
- haystack integration
- xapian-haystack indices for sentences, users, messages, comments
- tastypie integration for models
- basic tastypie integration for the python api (still needs work)
- major sql optimizations to the python api
- major numpy optimizations to the graphing backend on CPython
- alternative dependencies that are all pure python with no C required.
So we're now basically compatible with any python flavor or platform
with no extra compilation.
- iplus1 integration

Not Done:
- views (dropped during week 4 completely)
- js interface (there's still time for this and will be the top priority for the
rest of the time I have during gsoc)
- internationalization (well, no interface yet)
- multilingual stemming for haystack/xapian indices/queries
(unfortunately this needs an invasive patch into haystack's/xapian haystack's
code, so this won't be covered during gsoc)

So let's compare this with the original goals in my proposal which I'll copy
verbatim and mark with ✔ for complete support, ✘ for no support and ~ for
partial support:

Replicate the current functionality of the website:
✔ CRUD operations on sentences, comments, and wall posts
✔ logging of all operations on sentences
~ user profiles, status, and permissions
~ browsing of sentences, tags, and users
✔ messaging
✔ searching for sentences
✘ integration of transliteration
✘ internationalization of the entire website
Expand the website's functionality:
✔ a corrections framework
~ a corrections page with latest corrections that have been accepted, rejected, forced, or are overdue
~ uploading of audio
✔ advanced queries
✔ fetching translations up to the nth depth
✔ a user proficiency web of trust
✔ a user status web of trust
~ speed blocking/deletion of malicious users
~ a full fledged forum that can also be viewed as a wall
~ requesting words and translations
✘ sentence subscription system
✘ a user following system
✘ a whole new customisable user page that acts as a notification system for latest actions on subscribed sentences and latest actions from followd users, etc...
✘ rss and atom feeds for most pages that are essentially list views
✔ Enhance the database schema by using graph algorithms and/or integrating a graph database
Fully cache database queries and templates using memcached, and integrate an outside caching system such as varnish for server generated pages
✔ Build a set of python functions to manipulate the database and perform all functionalities that constitute an inner api, and use it in all view code and in future modules that extend the website
~ Build a RESTful api on top of the python api and django orm through tastypie
✘ Rewrite the UI to be crossplatform, dynamic, client-side, API compliant code.
~ Write a battery of tests that cover all the codebase
~ Provide administrative scripts for tasks such as importing/exporting db fixtures, importing/exporting csvs, adding new languages, extracting/updating mo/po files, search indexing, deployment on a development or production machine using vagrant/ansible
✘ Provide Help bot scripts that clean up corrections, among other things
✘ Provide an interface to all the admin scripts and help bot scripts where they can be executed manually or a cron job can be added and tweaked using the interface for them

This was done in ~8 weeks and still not close to covering everything. So I'm
essentially 2 weeks behind, actually 3 considering I'm still working on the
backend this week too. Testing stopped being a top priority, and code review
kind of died since liori was on vacation for like 2 weeks and had a family thing
for another week recently. Also the whole "huge patch" thing just made him less likely to give me any comprehensive feedback, so the feedback I got on the recent patches was a bird's eyeview of certain things I'm doing in the codebase that he doesn't like. But to really understand what happened, I'll have to walk you through all the things I decided to do after week 3, which is basically obsessing over adding more features, optimizing everything to death, and doing more interesting things like reading the codebases of third-party libraries instead of just making a website, which brings us to...

- The Road to Hell (featuring lool0 not dante):

The road to hell is interesting. At the gates you're told to abandon all hope
and in the deepest parts you watch satan eating brutus, cassius, and judas. And
at each level a sin with a poetic punishment.

- The Road to Hell and the feature craze:

Very early on I decided testing was making everything too slow, and decided
to keep adding as many features as possible with no regards to if anything
even worked. You can imagine this lead to major breakage, and it did to the
point that I made so many changes to the schema that every feature i developed
was not even manually testable, to manually test anything I have to isolate the
feature and use the last stable commit from week 3. I only started recovering
from this last week, but most tests still need to be readapted for them to
start passing. Only 6 of them pass as of the time of this report.

- The Road to Hell and the optimization craze:

You'd expect that having libraries that do everything for you would speed
development time so much you'd be done with everything in a week and then
you'd be sleeping for the next 11 weeks. But noooo... Enter the optimization
craze. Let's just dispell a great myth called the 'reusable app' first. In
the python/django world, there's this idea that it's great to invest time
in making these pythonic generic pluggable apps that can work with any other
django app out of the box. There's tons of these around and they can probably
do anything you can think of. You just need a few lines of configuration
and you're done, you get all of its features for free. This is the base of
the contrib package in django's core in fact. The auth system, the admin system,
etc...

- The Road to Hell and Reusability, Abstactions, and the ORM:
But let me tell you why python and django are always accused of not
being able to scale for shit. It can be summed up in one word: 'Leaky Abstractions'.
Django tries too hard to be pythonic and ends up wrapping the database in
a very pythonic abstraction known as the "ORM". What this does in the long
run is, the more generic the app is the worse the queries it will make for
you. At the heart of the problem is this thing called the ContentType system.
This basically allows things like "generic foreign keys" where you can have
a relation to a table dynamically without ever having to write and generate
the schema for it. This means more joins for every query. More work for your
database. Worse response time, and less scalability. It's the same problem
you get with multitable inheritance (which btw, some reusable apps i looked
at during the last 4 weeks actually used), death by joins.

It also happens to be the core reason behind all tatoeba's timeouts, inefficient queries involving joins that involve too many rows and the goddamn awful mysql that sucks at using indices, and sucks at joins to the point that 3 selects are better than a join (which really defeats the point of a relational database completely). But anyway going back to the story...

Instead of just using the damn apps and going to sleep, I was very curious
and every app I integrated or tried to integrate, I go ahead and read all
of its source code or at least the parts I care about the most. And of course,
you can imagine the horrors... So the end result was rip out or reimplement
core models and/or override classes if you can and include the package as is.
And that was the mantra. It took a week to look at userena, django-messages,
django-guardian, python-social-auth, etc... and adapt that and be happy with
the results, sort of. But this means anyone else trying to improve this code
will need to know the dependencies quite well, the python api will still be
usable without any of this knowledge though.

- The Road to Hell and python the turtle:
The last 2 weeks went to optimizing the scipy backend. You see i was getting
to know the graphing code of scipy better and then learned quite a lot about
numpy. And then realized... Python gets in the way and defeats lots of the
optimizations in scipy. So I went ahead and learned numpy properly and then
rewrote all the backend using numpy structures instead of python structures.
To say the least the results were amazing. 1 ms to redraw the entire graph
(3 million nodes and 6.6 million links). And 4 µs to add or remove a link
to the graph. I don't think I was ready for this so I tried to see if it was
possible to eliminate the Link table completely and was very disappointed
that I couldn't because I still needed to bookkeep nodes and links and build
the graph which after trying to optimize to death, took 5 secs and 70 mbs of
RAM. So I concluded it was possible but i needed to make a serious design change to what i store in the Link table so I can avoid link bookkeeping entirely
because it slowed shit too much. But I won't have time to implement this
during gsoc but it should be possible to only store information about the
connected components in a subgraph instead of links and eliminate link
bookkeeping entirely from the code. Of course, we can also eliminate the building of the graph everytime we need to calculate distances by just using a graph database like tatodb or orientdb just to map out the nodes. They build the graph once and use traversing algorithms instead of what I'm doing which is building the subgraph everytime and calculating the shortest distances (a really big hack on top of a relational db).

- The Road to Hell and raw sql on django:
Another thing I worked on for about a week was bulk operations. To give you
an overview, this is where we insert/update/delete lots of data at the same
time instead of one query each. It's basically the holy grail of relational
databases as they can do this quite well. They really scale with bulk operations. So the first obstacle is the realization that django's orm doesn't have any low level abstraction to build sql queries in a flexible way. This of course came after going through the Query, QuerySet, and sql compiling code in django's core, which also revealed another interesting problem, the bulk operations offered by the orm at the moment are broken. .delete() on the queryset does 2 queries, a select to get the ids and a subsequent DELETE on that. It, along with the bulk_insert are batched with a non-configurable batch size of 500 or so, so multiple queries will be issued that break your updates into 500 rows at a time. The .update() on the queryset is broken in a more interesting way. It updates one field in the exact same way on all the rows in the queryset. So there's no way to have true bulk updates where you need to change multiple rows in multiple different ways, a change per row or whatever. So I finally broke down and did the unthinkable. Build the sql query and parametrize it dynamically in python and issue it on the raw cursor connection to the db. This means you're on your own, mapping things to objects will be hard, django won't do it for you for sure. At the end of the week I had bulk_insert, bulk_update, and bulk_delete functions that did what I want and more and took a few milliseconds to affect hundreds of thousands of rows.

- Future Roadmap:

Ok now that you've put up with this wall of text let's talk about pytoeba's future. There's officially 4 weeks left in gsoc, 1 week for reviewing stuff, and 1 week for the final evaluation. Since tatoeba got some money already I don't really care about the gsoc timeline or evaluations and will just keep on working through these weeks, so that's 6 weeks left. I need a week to iron out some major problems in the api and a week to get more familiar with a shitton of javascript libraries. 4-6 weeks to finish the javascript part, 2 weeks to deal with other features that we really need like multilingual stemming and tests for everything and 1 week for proper high level documentation. So, let's set a date for an alpha release where you can all break pytoeba and a beta release where we slowly transition to pytoeba as the main platform. Alpha release on: september 15, 2014 and Beta release on: september 29, 2014. After GSOC is over, I'll move development from gerrit to github permanently and start adding proper tickets for everything that needs to be done, so anyone can join in the coding sprint towards the release dates.

P.S. Almost everyone else is done with the core of their gsoc projects. I'm negotiating integration with harsh and will need to take a closer look at pallav's scripts to integrate them as well. Also, pytoeba's backend compiles on android! so we can hope to support offline features once the angular app is ported to mobile using phonegap or appcelerator titanium, something which I'm very interested in doing. This should also work on ios and windows phone as python has been compiled successfully on these platforms as well but I haven't actually tried. Here's to a brighter future for tatoeba, cheers :)

----------------------------
intro message:
http://tatoeba.org/eng/wall/sho...#message_19536
progress reports:
week 1: http://tatoeba.org/eng/wall/sho...#message_19654
week 2: http://tatoeba.org/eng/wall/sho...#message_19768
week 3: http://tatoeba.org/eng/wall/sho...#message_19821
weeks 4-8: http://tatoeba.org/eng/wall/sho...#message_20001
-----------------------------
work log:
http://en.wiki.tatoeba.org/arti...work-log-lool0

patches pending review:
https://review.gerrithub.io/#/q...s:open+pytoeba

merged patches:
https://github.com/loolmeh/pytoeba

project template for testing:
https://github.com/loolmeh/pytoeba-dev

saeb saeb July 16, 2014 July 16, 2014 at 9:37:45 AM UTC link Permalink

It has to be stored first, but yes it should be possible.