Add caching to speed up your site: Even if you think it isn’t needed
In this post, we will explore solving the n+1 problem using caching. The tools used are Django ORM with which makes the problem becomes common, and the Cachalot library.
Writing systems at a start-up
As Donald Knuth famously stated,
Premature optimization is the root of all evil.
Along these lines, a startup has daily changing business requirements, so it is pointless to optimize your code today, which might be rewritten tomorrow.(sometimes completely)
Reason for this blog post
This blog post is written with the aim of exploring whether a plug and play caching solution should be used and can it prove to be of some optimization advantage when still at a nascent stage in product development.
My first thoughts in the context of LeanAgri were, will caching even make sense? To give an idea, LeanAgri’s backend is modelled as a normalized MySQL DB in order to have data consistency. Further, it is a permissioned model where users can only view and modify their own data. My idea was that it would be fairly pointless to think about caching in such a scenario where data is sparsely read.
As we go ahead, we will realise how data proved that I was wrong.
Badly written queries
What I didn’t realize in the above assumption was that our code was written badly. We wanted to get off the ground as quickly as possible, so it was never thought about to optimize the DB queries that we make.
We are hitting the n+1 problem at various places. Let’s try to understand with an example what the n+1 problem is. We have the following models where Schedule model is related to Crop and Fertilizer models.
class Fertilizer(Model): pass class Schedule(Model) crop = ForeignKey(Crop) fertilizers = ManyToManyField(Fertilizer)
With the above models, we want to get all related data for fertilizers for all schedules. How can we do that? In Django ORMs syntax, we simply do(the code should be fairly similar for any other ORM as well):
for schedule in Schedule.objects.all(): for fertilizer in schedule.fertilizer.all(): print fertilizer.get_data()
Now let’s try to analyze the above code by putting our “Optimization hats” on:
- We make 1 query to get all schedules
- We make 1 query per fertilizer to get data of all fertilizers for that schedule!!!
Thus, getting data for each schedule is going to be extremely costly for our Database.
The correct way to fix it: Throw away ORMs
ORMs are often considered leaky abstractions but really they help speed up the entire process of software development which is extremely crucial for a startup like us.
Fixing the above example using JOINs is fairly easy but non-intuitive way.
for schedule in Schedule.objects.select_related('fertilizers').all(): for fertilizer in schedule.fertilizers.all(): print fertilizer.get_data()
The above takes just 1 query to fetch everything.
There are various advantages of using the above:
- Get all data in a single query. Relational DBs are extremely efficient in doing this work.
- Duplicate queries don’t get executed, For example: Different schedules can have the same fertilizers and it is possible to remove any duplicate queries using JOINs.
Are we going to throw away our ORM?
Absolutely not. Atleast at this stage where we are rapidly creating more and more business logic, it makes no sense to sit and optimize each and every query when it is practically going to add very little value.
Is the above Bad code? Yes it is
Is it inevitable? Again yes it is. If you are working with something like Django Rest framework to create CRUD APIs from ORM models, it is going to be generate queries like the above everywhere.
Important disclaimer: This is not an excuse to write bad SQL queries, but it does become hard to optimize everything when you are working on a lean development cycle.
Solving the n+1 problem using Caching
Now we want to analyze if we can optimize the above queries with a simple plug and play caching solution. The aim of such a solution is to add nothing but a middle layer between our application and Database.
We chose the Cachalot library based on the interesting detail of research done by the author in the documentation.
A caching system would perform great in our case specially with:
These are in no way a benchmark, since the Django + uWsgi setup / SQL setup + Memcached and the Chrome to load the APIs were all running on my i7 laptop. However, looking at
htop
results from my computer, all cores were underutilized and a lot of RAM was free so these numbers would be very close to being real-world.
Test setup
My test setup has 2 queries which are correlated i.e. they have some common related foreign keys. This setup is run in the following different conditions:
Query AQuery B run after Query A (correlation)Query A run again (duplicate queries)
No cache (unoptimized) | 7s (550 queries) |
7s (530 queries) |
NA | Extremely bad, isn’t it! |
No cache (optimized using JOINs) | 450ms (26 queries) |
450ms (26 queries) |
NA | Wow, this blew through the roof. Maybe using ORMs are evil after all! |
Cache (unoptimized) | 3s (180 queries) |
2.5s (150 queries) |
850ms (0 queries) |
This is where it gets interesting. All queries improve dramatically against using no cache. |
Cache (optimized using JOINs) | 450ms (26 queries) |
500ms (24 queries) |
166ms (0 queries) |
As expected, caching brings no improvements in JOINed queries other than duplicate queries. |
Conclusions
From the above table, it is very clear that nothing beats writing the right queries, but adding a plug and play ORM cache does make life considerably better with reduction in queries to 1/3rd and speed improvements upto 50%.
But wait, why does Query A improve even without the Cache being warm?
Remember the duplicates problem? We are able to reduce the number of queries to DB for the related foreign keys. Same way, Query B’s number of queries to DB also reduces a lot by just having Query A objects in DB. In a case such as LeanAgri where queries are going to be sparse but correlated, this is good news.
Actually, aren’t the repeat queries worse in this case?
Yes they are, for a very simple reason. We are making more and more queries to our caching system instead of just one query. This is again less efficient than making a few queries to the caching system. We are substituting 500 SQL queries by 500 cache queries. It is going to be faster, but not as fast as making 25 cache queries.
Can we optimize it further?
There is another plug and play mechanism which can explore for speeding up the duplicates problem. I will just briefly try to explain it here. We know that our cached objects are likely to be highly reusable in the future too, take fertilizers in the above example, every schedule is possibly going to have the same fertilizers. Instead of even going to the Cache multiple times, we can possibly maintain details of all fertilizers in application memory. This is called Layered Caching with the first layer being Application memory and second be (a possibly remote) memcache or Redis layer. A simple implementation is here by Ben Kamens from Khan academy. I wouldn’t go there yet since we don’t need it at LeanAgri currently.
Let’s finally conclude, shall we?
Sure!