Introduction to Aggregates in Django

Sept. 11, 2009
Alex Gaynor
10 p.m.

With the 1.1 release of Django came several major new features, prominent among these is aggregation, which was among the most oft-requested for Django's ORM. The addition and manner of implementation of this feature highlights several fundamental philosophies within Django, as well as provides new possibilities. Fundamentally aggregates support in Django means a way to answer queries such as "show me all the authors who have published more than 1 book" or "show me the most expensive book".

One of the most important things to understanding how the aggregates API works is to understand that Django's ORM does not try to replace SQL, or provide an API to SQL in Python. The purpose of Django's ORM is to represent some persistent datastore, and have a way to put objects in and get them out. As such nothing in the API is SQL specific, and you will never see a discussion of how to get a GROUP BY or HAVING clause in your query. Instead the discussion around aggregates is centered around answering two different kinds kinds of questions, one is "What is the maximal[or minimal, or average, or count, etc.] value of some field in this group of objects", the other is "For each of the objects in this group, what is the maximal[or minimal, or average, or count, etc.] of some other group of objects that is related". We can think of these types of queries in terms of what they return, the first returns a scalar value, while the second returns a group of objects(in our case a QuerySet) that carry with them some additional information. In the case of the second type of query, since all that additional data exists in our backend we can do all the type of operations we would expect, such as further filtering, ordering, or even computing another aggregate over those values.

Let's consider the first type of query, since they are slightly simpler. For the purpose of our examples we are going to work with two models:

from django.db import models

class Author(models.Model):
    name = models.CharField(max_length=100)

    def __unicode__(self):
        return self.name

class Book(models.Model):
    authors = models.ManyToManyField(Author)
    title = models.CharField(max_length=200)
    price = models.DecimalField(decimal_places=2, max_digits=6)

    def __unicode__(self):
        return self.title

These are two relatively simple models with which we can demonstrate most of the features of aggregation. For example, a simpler query might be how much does the most expensive book cost?:

>>> from django.db.models import Max
>>> Book.objects.aggregate(max_price=Max('price'))
{'max_price': Decimal('56.49')}

There are a few important things to note here. First, aggregate() is a sentinel method on a QuerySet, that means that unlike a lot of other methods on a Queryset you can't chain more methods afterwords. Secondly, aggregate() returns a dictionary, mapping the alias provided in aggregate to the result value. Aggregates itself takes any number of keyword arguments with the keyword being the alias, and the value being the aggregate itself. You can also give aggregate positional arguments, in which case the alias is a default one constructed based on what field is being aggregated on, and what type of aggregate is being preformed.

Out of the box Django provide's support for seven types of aggregates, sum, maximum, minimum, average, count, standard deviation, and variance. It is also possible to create your own aggregation classes, which modules like GeoDjango (django.contrib.gis) take advantage of. Each aggregate class is instantiated with a string that refers to the field the aggregation should be preformed over, with full support for the "__" syntax to refer to related fields, as seen elsewhere in Django's ORM.

The second type of query we can do preform operations for each item in the QuerySet. So for example we might ask how many author's each book has:

>>> from django.db.models import Count, Max
>>> books = Book.objects.annotate(num_authors=Count('authors'))
>>> books
[<Book: Pro Django>, <Book: Practical Django Projects>, <Book: The
Definitive Guide to Django>]
>>> [book.num_authors for book in books]
[1, 1, 2]

We might only want books that have more than one author::

>>> Book.objects.annotate(num_authors=Count('authors')).filter(num_authors__gt=1)
[<Book: The Definitive Guide to Django>]

Or we might want to know what the greatest number of authors any book has::

>>> Book.objects.annotate(num_authors=Count('authors')).aggregate(max_authors=Max('num_authors'))
2

There are several features that distinguish the annotate() method from aggregate() one. First, it returns another QuerySet, second, each object it returns is a normal Model instance, except it has extra attributes corresponding to the aggregates were requested. However, like the aggregate() method it takes any amount of keyword or positional arguments which are handled in the same way. The QuerySet can be further manipulated, but fundamentally what annotate() does is give us access to an extra value on each object.

Strictly speaking these were all computations that we could have solved before in pure Python, however there are a number of distinct advantages to doing these at the datastore level. First, it's going to be faster, compared to our datastore Python is going to be very slow to do these calculations, our datastore is built to do these computations with large numbers of records, so we should let it do its job, second, it saves bandwidth, to do a calculations like preform an annotate, sort by its result, and take a subset of that data in Python we would need to pull in every single record, depending on the size of our dataset this could mean pulling in millions of records, which is unfeasable.

The important point to take away is that when trying to figure out how to write a query using the Django ORM it is most important not to think of the query in terms of what the SQL would look like, but instead in terms of what question are we trying to answer, and from there try to figure out what aggregations or annotations we need, and what filtering or ordering we need to preform, and what slicing we need to do. By following these steps it becomes much easier to put together queries that answer the questions that we need answered.

Alex Gaynor is a computer science student at Rensselaer Polytechnic Institute. Right now he works for Eldarion. You can find him around the internet on Twitter or Github.