Working around MySQL’s horrible “ORDER BY rand()” in Django
Recently I’ve been working on a web 2.0ish community site written in Django. As is frequently the case with such sites I often need to create lists or collections of psuedo-randomly selected items. For example on a user’s profile page there may be a box showing a few of the user’s friends, another box with some of the user’s
pictures, or a box with random comments the user has made.
Ideally in this type of situation I would simply preform a query with what ever rules I need and a random “order by” statement such as the following:
pictures = User.picture_set.filter(private=False).order_by('?')[:6]
Which generates SQL similar to:
SELECT * FROM `pictures` WHERE private = 0 AND user_id = 5 ORDER BY rand() LIMIT 6
In an ideal world this works great, I only get back as much data as I need and the site stays interesting because what the user sees is constantly changing/rotating. Unfortunately unless you are using PostgreSQL or have less than a few hundred rows in the pictures table you are not in an ideal world. Under both MySQL 4 and 5 random ordering is horrifyingly inefficient. Adding that “ORDER BY rand()” can make a query that took a few milliseconds to run instead take tens of seconds. The problem gets even worse when you are working with a table containing millions of rows and thousands of rows which meet the criteria of your query. What is one to do?
The common solutions to this problem generally involve generating a list of random numbers and selecting the rows with matching primary key. This is of no use since there is no way to know in advance which primary keys meet our criteria. Another solution is to simply have the query return all of the values which meet our criteria. Again this is of no use since it is too inefficient and there would be way too much data being sent back from the database. A third solution would be to drop the “order by random” and take what ever the database gives us. This will work but why make your site worse just because of a deficiency in MySQL?
What I’ve come up with for this situation is a compromise between all of these solutions. First I figure out how many items I wish to display, for this example say let’s say six. Next I select a significantly larger number, this number will be the size of the pool which I will randomly select from. From here I will either do a select with a limit of my pool size. Finally I randomly select the number of objects I need from my pool using python’s random.sample function. This way I am able to efficiently get the data I need while keeping the site lively and interesting. Below is an example of doing this in Django:
# Get the required module
import random
# Get the pool of objects
picture_pool = user.picture_set.filter(private=False).order_by('?')[:100]
# Randomly select objects from the pool
pictures = random.sample(picture_pool, 6)
Finally if I need to do this with particularly large or complex objects I will select a list of ids for my pool using a custom query, use random.sample to reduce the pool to the number of objects I need, and then I select the full objects with this id list. See below for an example:
# Get the needed modules
import random
from django.db import connection
# Perform the custom SQL query
cursor = connection.cursor()
cursor.execute("""
SELECT DISTINCT id
FROM pictures
WHERE user_id = %d
AND private = 0
LIMIT 100""" % user.pk)
rows = cursor.fetchall()
limited_rows = random.sample(rows, 6)
# Format the ids into a useful list
picture_ids = []
for row in limited_rows:
picture_ids.append(row[0])
# Get the full objects
pictures = user.picture_set.filter(id_in=picture_ids)
So there it is, a way to finally work around that nasty ORDER BY rand() in MySQL with Django. Know a better solution? See a way to improve this one? Leave a comment and let me know!

Is it ok to use pictures = User.picture_set.filter(private=False).order_by(‘?’)[:6] with postgres?
You missed a chance to use my favourite Python-ism: list comprehensions. Last few lines could be written like this:
rows = cursor.fetchall()
picture_ids = [row[0] for row in random.sample(rows, 6)]
pictures = user.picture_set.filter(id__in=picture_ids)
I made a blog post about pretty much the same thing a few days ago. Might be an interesting read
http://blog.buffis.com/?p=64
Martin: As far as I know Postgres is smarter/faster at doing randomly ordered selects but do some experimentation. If you get the performance you are looking for then great, otherwise this solution may be for you.
David: Very cool, there are actually a few places in my code where I do that, just didn’t happen to come across it when I grabbed this example. Thanks for the tip.
Bjorn: Also a cool solution, nice work.
Sean O'Connor said this on January 27, 2008 at 11:02 pm |