tl;dr: I made Patchwork a lot faster by looking at what database queries were being generated and optimizing them either by making Django produce better queries or by adding better indexes.
Introduction to Patchwork
One of the key bits of infrastructure a bunch of maintainers of Open Source Software use is a tool called Patchwork. We use it for a bunch of OpenPOWER firmware development, several Linux subsystems use it as well as freedesktop.org.
The purpose of Patchwork is to supplement the patches-to-a-mailing-list development work flow. It allows a maintainer to see all the patches that have been posted on the list, How many Acked-by/Reviewed-by/Tested-by replies they have, delegate responsibility for the patch to a co-maintainer, track (and change) the state of the patch (e.g. to “Under Review”, “Changes Requested”, or “Accepted”), and create bundles of patches to help in review and testing.
Since patchwork is an open source project itself, there’s several instances of it out there in common use. One of the main instances is https://patchwork.ozlabs.org/ which is (funnily enough) used by a bunch of people connected to OzLabs for projects that are somewhat connected to OzLabs. e.g. the linuxppc-dev project and the skiboot and petitboot projects. There’s also a kernel.org instance, which is used by some kernel subsystems.
Recent versions of Patchwork have added some pretty cool features such as the ability to integrate with CI systems such as Snowpatch which helps maintainers see if patches submitted are likely to break things.
Unfortunately, there’s also been some complaints that recent version of patchwork have gotten slower than previous ones. This may well be the case, or it could just be that the volume of patches is much higher and there’s load on the database. Anyway, after asking a few questions about what the size and scope was of the patchwork database on ozlabs.org, I went “hrm… this sounds like it shouldn’t really be a problem… perhaps I should look into this”.
Attacking the problem…
Every so often it is revealed that I know a little bit about databases.
Getting a development environment up for Patchwork is amazingly easy thanks to Docker and the great work of the Patchwork maintainers. The only thing you need to load in is an example dataset. I started by importing mail from a few mailing lists I’m subscribed to, which was Good Enough(TM) for an initial look.
Due to how Django forces us to design a database schema though, the suggested method of getting a sample data set will not mirror what occurs in a production system with multiple lists. It’s for this reason that I ended up using a copy of a live dataset for much of my work rather than constructing an artificial one.
Patchwork supports both a MySQL and PostgreSQL database backend. Since the one on ozlabs.org is backed by PostgreSQL, I ended up loading a large dataset into PostgreSQL for most of my work, although I also did some testing with MySQL.
The current patchwork.ozlabs.org instance has a database of around 13GB in side, with about a million patches. You may think this is big, my database brain goes “no, this is actually quite small and everything should be a lot faster than it is even on quite limited hardware”
The problem with ORMs
It turns out that Patchwork is written in Django, an ORM (Object-Relational Mapping) framework in Python – and thus something that pretty effectively obfuscates application code from the SQL being run.
There is one thing that Django misses that could be a pretty big general performance boost to many applications: it doesn’t support composite primary keys. For some databases (e.g. MySQL’s InnoDB engine) the PRIMARY KEY is a clustered index – that is, the physical layout of the rows on disk reflect primary key order. You can use this feature to your advantage and have much higher cache hits of your database pages.
Unfortunately though, we cannot do that with Django, so we lose a bunch of possible performance because of it (especially for queries that are going to have to bring in data from disk). In fact, we’re forced to use an ID field that’ll scatter our rows all over the place rather than do something efficient. You can somewhat get back some of the performance by creating covering indexes, but this costs in terms of index maintenance and disk space.
It should be noted that PostgreSQL doesn’t have a similar concept, although there is a (locking) CLUSTER statement that can (as an offline operation for the table) re-arrange existing rows to be in index order. In my testing, this can give a bit of a boost to performance of some of the Patchwork queries.
With MySQL, you’d look at a bunch of statistics on what pages are being brought in and paged out of the InnoDB buffer pool. With PostgreSQL it’s a bit more complex as it relies heavily on the OS page cache.
My main experience is with MySQL like environment, so I’ve had to re-learn a bunch of PostgreSQL things in this work which was kind of fun. It may be “because of my upbringing” but it seems as if there’s a lot more resources and documentation out in the wild about optimizing MySQL environments than PostgreSQL ones, especially when it comes to documentation around a bunch of things inside the database server. A lot of credit should go to the MySQL Documentation team – I wish the PostgreSQL documentation was up to the same standard.
Another issue is that fetching BLOBs is generally an expensive operation that you want to avoid unless you’re going to use them. Thus, fetching the whole “object” at once isn’t always optimal. The Django query generation appears to be somewhat buggy when it comes to “hey, don’t fetch these columns, I don’t need them”, so you do have to watch what query is produced not just what query you expect to be produced. For example, [01/11] Improve patch listing performance (~3x).
Another issue with Django is how you go from your Python code to an actual SQL query, especially when the produced SQL query is needlessly complex or inefficient. I’ve seen Django always produce an ORDER BY for one table, even when not needed, I’ve also seen it always join tables even when you’re getting no columns from one of them and there’s no way you’re asking for it. In fact, I had to revert to raw SQL for one of my performance improvements as I just couldn’t beat it into submission: [10/11] Be sensible computing project patch counts.
An ORM can be great for getting apps out quickly, or programming in a familiar way. But like many things, an understanding of what is going on underneath is key for extracting maximum performance.
Also, if you ever hear something like “ORM $x doesn’t scale” then maybe that person just hasn’t looked at how to use the ORM better. The same goes for if they say “database $y doesn’t scale”- especially if it’s a long existing relational database such as MySQL or PostgreSQL.
Speeding up listing current patches for a project
Fortunately though, the Django development environment lets you really easily dive into what queries are being generated and (at least roughly) where they’re being generated from. There’s a sidebar in your browser that shows how many SQL queries were needed to generate the page and how long they took. The key to making your application go faster is to run fewer queries in less time.
I was incredibly impressed with how easy it was to see what queries were run, where they were run from, and the EXPLAIN output for them.
By clicking on that SQL button on the right side of your browser, you get this wonderful chart of what queries were executed, when, and how long they took. From this, it is incredibly obvious which query is the most problematic: the one that took more than four seconds!
In the dim dark days of web development, you’d have to turn on a Slow Query Log on the database server and then grep through your source code or some other miserable activity. I am so glad I didn’t have to do that.
This particular query was a real hairy one, the EXPLAIN output from PostgreSQL (and MySQL) was certainly long and involved and would most certainly not feature in the first half of an “Introduction to query optimization” day long workshop. If you haven’t brushed up on various bits of documentation on understanding EXPLAIN, you should! The MySQL EXPLAIN FORMAT=JSON is especially fantastic for getting deep details as to what’s going on with query execution.
The big performance gain here was to have the database be able to execute the query in a much more efficient way by creating two covering indexes for part of the query. To work out what indexes to create, one has to look at the EXPLAIN output and work out why the database is choosing to do either a sequential scan of a large table, or use an index that doesn’t exclude that many rows. In this case, I tweaked the code to slightly change the query that was generated as well as adding a covering index. What we ended up with is something that is dramatically faster.
You’ll notice that it appears that the first query there takes a lot more time but it doesn’t, it just takes a lot more time relative to the main query.
In fact, this particular page is one that people have mentioned at being really, really slow to load. With the main query now about 350 times faster than it was originally, it shouldn’t be a problem anymore.
A future improvement would be to cache the COUNT() for the common case, as it’s pretty easily computed when new patches come in or states change.
The patches that help this particular page have been submitted upstream here:
- [01/11] Improve patch listing performance (~3x)
- [05/11] Add covering index for /list/ query
- [06/11] check distinct(user) based on just user_id
- [08/11] Add covering index to patchwork_submissions for /list/ queries
Making viewing a patch with comments faster
Now that we can list patches faster, can we make other pages that Patchwork has quicker?
One such page is viewing a patch (or cover letter) that has a lot of comments on it. One feature of Patchwork is that it will display all the email replies to a patch or cover letter in the Web UI. But… this seemed slow
On one of the most commented patches I could find, we ended up executing one hundred and seventy seven SQL queries to view it! If we dove into it, a bunch of the queries looked really really similar…
The problem here is that the Patchwork UI is wanting to find out the name of each person who submitted a comment, and is doing that by querying the ID from a table. What it should be doing instead is a SQL JOIN on the original query and just fetching all that information in one go: make the database server do the work, it’s really good at it.
My patch [02/11] 4x performance improvement for viewing patch with many comments Â does just that by using the select_related() method correctly, as well as being explicit about what information we want to retrieve.
With that patch, we’re down to a constant number of queries and around a 3x-7x faster time executing them depending if we have a warm cache or not.
The one time I had to use raw SQL
When viewing a project page (such as https://patchwork.ozlabs.org/project/qemu-devel/ ) it displays the number of patches (archived and not archived) for the project. By looking at what SQL queries are executed to collect these numbers, you’ll notice two things. First, here are the queries:
First thing you’ll notice is that they took a loooooong time to execute (more than a second each). The second thing, if you look closer, is that they contain a join which is completely unneeded.
I spent a good long while trying to make Django behave, and I just could not. I believe it’s due to the model having some inheritance in it. Writing the query by hand ended up being the best solution, and it gave a significant performance improvement:
Arguably, a better way would be to precompute the count for the archived/non-archived patches and just display them. I (or someone else who knows more about Django) may want to look at that for a future improvement.
Conclusion and final thoughts
There’s a few more places where there could be some optimizations, but currently I cannot get any single page to take more than between 40-400ms in the database when running on my laptop – and that’s Good Enough(TM) for now.
The next steps are getting these patches through a round or two of review, and then getting them into a Patchwork release and deployed out on patchwork.ozlabs.org and see if people can find any new ways to make things slow.
If you’re interested, the full patchset with cover letter is here: [00/11] Performance for ALL THE THINGS!
The diffstat is interesting, as most of the added code is auto-generated by Django for database migrations (adding of indexes).
.../migrations/0027_add_comment_date_index.py | 23 +++++++++++++++++ .../0028_add_list_covering_index.py | 19 ++++++++++++++ .../0029_add_submission_covering_index.py | 19 ++++++++++++++ patchwork/models.py | 21 ++++++++++++++-- patchwork/templates/patchwork/submission.html | 16 ++++++------ patchwork/views/__init__.py | 8 +++++- patchwork/views/cover.py | 5 ++++ patchwork/views/patch.py | 7 ++++++ patchwork/views/project.py | 25 ++++++++++++++++--- 9 files changed, 128 insertions(+), 15 deletions(-) create mode 100644 patchwork/migrations/0027_add_comment_date_index.py create mode 100644 patchwork/migrations/0028_add_list_covering_index.py create mode 100644 patchwork/migrations/0029_add_submission_covering_index.py
I think the lesson is that making dramatic improvements to performance of your Django based app does not mean you have to write a lot of code or revert to raw SQL or abandon your ORM. In fact, use it properly and you can get a looong way. It’s just that to use it properly, you’re going to have to understand the layer below the ORM, and not just treat the database as a magic black box.
Pingback: pwnm-sync: Synchronizing Patchwork and Notmuch | Ramblings