How to Add Django Database Caching in 5 Minutes

ltbarcly3 · on Dec 7, 2011

Or you can use a database where join performance isn't pathological, and scale out by adding read slaves.

Caching is hard. Caching as an automated 'layer' is much harder. If it were possible to cache in a general way databases would do it already. Adding a 'caching layer' is opening a gate into hell. The things is, at first it will be fine. Thousands of engineer hours and hundreds of subtle bugs later, you'll (if you're wise) realize that opening that door let out many demons, you just didn't have the eyes to see them at the time.

"There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton

rbranson · on Dec 7, 2011

This, so much. Caching should never be a "5 minute" thing and should only be considered once all other options are exhausted. Once teams start to go down this road, they should also begin working on a longer-term strategy for denormalization and scalable data storage.

bjpless · on Dec 8, 2011

MySQL does already automatically cache queries internally. However, the scale of that caching is limited, in part because it's difficult to scale MySQL horizontally in the same way as Memcached or other distributed caching systems.

Adding read slaves is a great on many levels. Adding slaves, however, is not without overhead and stale data/programming complexity exists there too.

famousactress · on Dec 8, 2011

The caching that MySQL does effectively useless unless all of your queries deserve equal priority (times their individual footprint) for memory. That's almost never the case. I don't want relatively low-impact-rare-reads to take up space that I'd prefer to use for high-impact-common-reads, for instance.

It's nice that MySQL caches queries, but it doesn't solve the same problem that an application-level cache does.

dextorious · on Dec 8, 2011

""" If it were possible to cache in a general way databases would do it already. """

In some level, they DO do it already.

Beyond that level, they don't do it because it's application specific.

fdintino · on Dec 7, 2011

The timing of this post is funny, as I just got finished reworking our fork of django-cache-machine. As the post points out, the limitation of Cache Machine as it is currently built is that only objects which are already within a queryset can invalidate that queryset. This is fine for selects on primary keys, but beyond that the invalidation logic is incomplete.

My changes ( https://github.com/theatlantic/django-cache-machine/ ) inspect the ORDER BY clauses (if there's a limit or offset) and WHERE constraints in the query and saves the list of "model-invaliding columns" (i.e., columns which, when changed, should broadly invalidate queries on the model) in a key in the cache. It also associates these queries with a model-flush list. Then, in a pre_save signal, it checks whether any of those columns have changed and marks the instance to invalidate the associated model flush list when it gets passed into the post_save signal. We have these changes up on a few of our sites and, if all goes well, we're looking to move the invalidation at the column level to make the cache-hit ratio even higher.

bjpless · on Dec 8, 2011

This is an interesting approach. I'm definitely going to check it out. Thanks!

zzzeek · on Dec 7, 2011

Took a look at the Django Cache Machine they mention, at the invalidation scheme (http://jbalogh.me/projects/cache-machine/#cache-manager). It stores a "flush list", linking objects to their originating, cached queries. Interesting, though looks like something that could get out of hand quickly - are they storing the "flush list" itself in the cache (else how do all nodes learn of the invalidation) ? That's interesting, though a little creepy (the list gets very large as they appear to be keying on the SQL itself ?) Then they have the invalidation flow along all the relationships - maybe that's OK, but maybe it leads to a lot of excessive invalidation. Also they have a notion of how to avoid a certain kind of race condition there, caching ``None`` instead of just deleting, but it's not clear to me how that helps - you really need a version ID there if you want to prevent that particular condition (else thread 1 puts None, thread 2 sees None, OK we'll put a new one there, thread 3, which started before all of them, doesn't see the None, puts stale data in).

Really if you're caching SQL queries and such, you really, really should be doing little to no modification of cached data - this library makes it seem "easy" which it's not.

bjpless · on Dec 8, 2011

Your point is well taken; however, as the writer of the article, I'd say this. Our business requirements dictate that slightly inconsistent data is acceptable in certain circumstances. We would not retrieve objects through the caching flow in situations where absolute data integrity is required.

That points makes all the difference, in my mind. To your point about excessive invalidation, that depends on your Read/Write workload.

zzzeek · on Dec 8, 2011

well if I have a system where inconsistent data is OK, I just let the normal cache expiration time logic handle that. Turning down expiration times to a few minutes, even 30 seconds, can still take lots of load off of particular "hot" objects that might be fetched multiple times quickly (such as from a series of AJAX requests) while making it unlikely that large inconsistencies will show up on the screen.

jmoiron · on Dec 7, 2011

full disclosure: johnny-cache author

The top commenter in OP gave a great rundown of these projects and their evaluation of them at YCharts at a Django NYC meetup a month-ish ago; I'm sure his slides are available on the nyc django site somewhere.

All of these projects "automatically" manage cache for querysets, but they do it different ways, and can be susceptible to poor performance under different usage patterns.

From what I can tell, JC adds the lowest amt of overhead to cache misses and hits, and uses the simplest (it's mildly sophisticated, but still straightforward) management algorithm. It's the only one that works fine when using UPDATE queries that do not mention row ids, and (as a result) is the one that most greedily invalidates on writes.

The others are fine projects run by smart people, and depending on your site's situation, I'd recommend some of them over johnny-cache. It's a good idea to evaluate them all, as they certainly did at YCharts (his section on JC was very accurate), and as OP seems to have done.

gtaylor · on Dec 8, 2011

We use johnny-cache on a few projects with great results, thanks for your work!

ceol · on Dec 7, 2011

Thanks so much for linking to this post. I was just thinking how I would go about caching QuerySets for a Django project.