Shivani Sharma — September 1, 2021
Beginner Data Engineering Programming Python SQL Structured Data

This article was published as a part of the Data Science Blogathon

Objective

In this article, I am going to discuss DjangoORM and topics included as:

  • let’s look at the data schema (raw SQL);

  • Let’s describe this schema using Django models;

  • Let’s get acquainted with a few tricks for easy debugging;

  • and explore examples of queries.

During our introduction to queries, we will pay close attention to the raw query that the libraries generate because it is the final SQL code that determines the performance of queries and their number.

Schema

Let’s start with the data schema we are going to work with.

Suppose we have an application where users (User) can ask questions (Question) within different topics (Topic). The theme has a picture (Image). The user’s access to the topic is determined through the many-to-many relation TopicUser, which, in addition to the ForeignKey (FK) for the user and the FK for the topic, contains additional information: the user’s role in the topic.

Data schema. | SQL in DjangoORMImage 1

You can create a database with the following SQL:

CREATE TABLE "image"
(
    "id"   serial NOT NULL PRIMARY KEY,
    "name" text   NOT NULL
);
CREATE TABLE "topic"
(
    "id"       serial  NOT NULL PRIMARY KEY,
    "title"    text    NOT NULL,
    "img_id" integer REFERENCES "image" ("id") 
);
CREATE TABLE "user"
(
    "id"   serial NOT NULL PRIMARY KEY,
    "name" text   NOT NULL
);
CREATE TABLE "topic_user"
(
    "id"       serial  NOT NULL PRIMARY KEY,
    "role"     text    NOT NULL,
    "top_id" integer REFERENCES "topic" ("id"),
    "use_id"  integer REFERENCES "user" ("id") 
);
CREATE TABLE "question"
(
    "id"       serial  NOT NULL PRIMARY KEY,
    "text"     text    NOT NULL,
    "top_id" integer REFERENCES "topic" ("id") 
);

Django models

Now let’s describe this scheme inside the ORM:

from django.db import models
class Image(models.Model):
class Meta:
        db_table = 'image'
    name = models.TextField()
class Topic(models.Model):
class Meta:
        db_table = 'topic'
    title = models.TextField()
    image = models.ForeignKey(Image, on_delete=models.DO_NOTHING)
    users = models.ManyToManyField('User', through='TopicUser')
class Question(models.Model):
class Meta:
        db_table = 'question'
    text = models.TextField()
    topic = models.ForeignKey(Topic, on_delete=models.DO_NOTHING, related_name='questions')
class TopicUser(models.Model):
class Meta:
        db_table = 'topic_user'
    topic = models.ForeignKey(Topic, on_delete=models.DO_NOTHING)
    user = models.ForeignKey('User', on_delete=models.DO_NOTHING)
    role = models.TextField(max_length=64)
class User(models.Model):
class Meta:
        db_table = 'user'
    name = models.TextField()

All models inherit from models.Model. I Meta.db_tableadvise you to explicitly specify the names of the tables, otherwise Django will generate them for you on its own, and such names are not always what you need. models.ForeignKeywill spawns a title column _id. And you can always refer to a field with such a name, although we do not explicitly indicate it when describing the model.

Django has a powerful built-in migration tool. In most cases, it does exactly what you expect it to do. But it is better to always check the SQL query that will be executed by the migration.

// generate migration
python manage.py makemigrations // migration with number 0001 will appear
// see what SQL will be executed in the database when migrating
python manage.py sqlmigrate topics 0001

Preparation

In addition to the models that represent tables in the code of our application, we will need an additional configuration of the libraries so that we can see in the log all the SQL queries that will be sent to the database for execution. It is extremely useful to look at the SQL code obtained from ORM queries because it is not always possible without proper experience to immediately achieve exactly the behaviour that we expect from the library.

It is convenient to test our requests inside tests (there is no need to send requests to the application and it is convenient to use the debugger built into the IDE). For Django to display SQL queries, several manipulations are required.

First, add a logger to settings.LOGGING:

       'django.db.backends': {
            'level': 'DEBUG',
            'handlers': ['console'],
            'propagate': False,
        },

Also, set the console to debug level to the handler DEBUG.

Secondly, you need to decorate the test:

from django.test import override_settings
@override_settings(DEBUG=True)
def test_i(self):

In the following examples, the given code will be just a block. I often use these settings when running a project locally during development to see if the queries are written optimally.

We are ready to start writing requests.

 

Example 1

We want to get the last ten questions from a specific topic:

SELECT *
FROM question
WHERE topic_id = 1
ORDER BY id DESC
LIMIT 10;
questions = Question.objects.filter(topic_id=self.t1_id).order_by('-id')[:10]

In the console, we will see the SQL, which they themselves wrote above, with some modifications introduced by the ORM. Notice how unusual (at least for me) it looks limit in this ORM.

Example 2

Now we want to find in the first topic such questions, the text of which either contains a sub-string factor or does not contain a mention dog, regardless of their case.

We can easily portray this in SQL:

SELECT *
FROM question
WHERE (
  topic_id = 1 AND (
      text ILIKE '%fart%' OR
NOT (text ILIKE '%dog%')
  )
);

Django provides all kinds of field lookups for these purposes, one of which we will now use:

questions = Question.objects.filter(
    Q(topic_id=1),
    Q(text__icontains='fart') | ~Q(text__icontains='dog')
)

But if you do not write anything else in the test, then you will not see the SQL representation of our query in the console, although all queries sent to the database should be displayed there. The point is that Django query sets are lazy. That is until the application needs real data from the database, the request will not be executed. And it is executed, for example, during iteration, taking a length or a slice (as in the previous example).

I also want to note that when using a debugger (for example, the built-in PyCharm), the result may differ, because the debugger can provoke an earlier request to the database. Due to this, it can show us the contents of the object. Don’t be fooled!

So, let’s add something to the example that will cause the request to be executed, and we will see in the console:

SELECT "question"."id", "question"."text", "question"."topic_id"
FROM "question"
WHERE (
    "question"."topic_id" = 1 AND
    (
UPPER("question"."text"::text) LIKE UPPER('%fart%') OR
NOT (UPPER("question"."text"::text) LIKE UPPER('%dog%'))
    )
);

Although the description of Icontains says that using it in the code will lead to using ILIKE, while writing this article, I got the option from time to time with UPPER. This is one of the many arguments for validating the queries that the ORM generates. In Python code, the negation of a condition through ~and linking of conditions through are worth mentioning.

 

Example 3

The third example brings us to the topic of query optimization. Let’s say we want to get data not only of the question itself but also of the related topic. In Django, you don’t have to put in any effort!

q = Question.objects.filter(id=self.q1_id).first()  # dj.3.1
print(q.topic.title)  # dj.3.2

But if you look in the log, you will see an additional request to the database for the topic. That is, and dj.3.1, and dj.3.2generate queries to the database. Often we know in advance that we will need a related entity. In this case, you can ask the ORM at the first request to “pull up” the topic using a JOIN.

q = Question.objects.select_related('topic').filter(id=self.q1_id).first()  # dj.3.3
print(q.topic.title)  # dj.3.4

Now only one request goes to the database. It contains a JOIN, through which the ORM receives the topic data.

SELECT "question"."id",
       "question"."text",
       "question"."topic_id",
       "topic"."id",
       "topic"."title",
       "topic"."image_id"
FROM "question"
INNER JOIN "topic" ON ("question"."topic_id" = "topic"."id")
WHERE "question"."id" = 1
ORDER BY "question"."id" ASC
LIMIT 1;

We will get a particularly noticeable performance gain when we process questions in a loop. Without it, select_relatedwe would have received N + 1 requests, where N is the number of questions instead of one.

 

Example 4

This example will deepen our knowledge of the impatient loading technique. Let us not only need a topic related to the question, but also an image of this topic. As we’ve seen, Django allows us to retrieve related entities without additional steps:

for q in Question.objects.all():
    print(q.topic.title)  # dj.4.1
    print(q.topic.image.name)  # dj.4.2
But at what cost? This code calls two additional requests at each step of the cycle: dj.4.1 - for the topic, dj.4.2 - for the image of the topic. To avoid this, use again select_related:
for q in Question.objects.select_related('topic__image').all():  # dj.4.3
    print(q.topic.title)
    print(q.topic.image.name)

So we will “load” both the topic and the corresponding image. We received a more complex request with two JOINs, but only one trip to the base instead of (1 + 2 * N).

SELECT "question"."id",
       "question"."text",
       "question"."topic_id",
       "topic"."id",
       "topic"."title",
       "topic"."image_id",
       "image"."id",
       "image"."name"
FROM "question"
INNER JOIN "topic" ON ("ques"."top_id" = "top"."id")
INNER JOIN "image" ON ("top"."img_id" = "img"."id");

It is worth taking a closer look at JOINs here. Both table joins use INNER JOIN. This is a consequence of the fact that FK is declared as not nullable. If, when describing a topic with an image, you clearly indicate the possibility of the absence of a picture:

image = models.ForeignKey(Image, on_delete=models.DO_NOTHING, null=True)

Example 5

Now let’s say we want to handle all the questions for each topic. Django helps us here with its magic, and we can go through the array of questions completely artlessly.

for t in Topic.objects.filter(id__in=(self.t1_id, self.t2_id)):  # dj.5.1
    print([q.text for q in t.ques.all()])  # dj.5.2

Note that in dj.5.2we refer to the questions specified as related_name when describing the topic model field Question. By default, such sets are named FOO_set, where FOO is the name of the table defining the ForeignKey. The log shows that dj.5.2generates an additional request to the database for all the questions of the topic. And this happens for every topic. This can be avoided:

for t in Topic.objects.filter(id__in=(self.t1_id, self.t2_id)).prefetch_related(‘questions’): # dj.5.3
print([q.text for q in t.ques.all()]) # dj.5.4

With this use, the prefetch_relatedline dj.5.3will generate two queries to the database at once. The first one is for the desired topics:

SELECT "topic"."id", "topic"."title", "topic"."image_id"
FROM "topic"
WHERE "topic"."id" IN (1, 2);
The second is for their questions:
SELECT "question"."id", "question"."text", "question"."topic_id"
FROM "question"
WHERE "question"."topic_id" IN (1, 2);

Due to the second query when iterating over the list (string dj.5.4), we will not have additional queries to the database. The payoff is 2 instead of (1 + N). For clarity, in the second request, the identifiers for the IN clause are taken from the first request. That is, the second request will always use the IDs from the first request.

Example 6

Suppose now we need to load not all questions to the topic, but only those that meet a certain condition. However, some topics may have an empty list of questions. In Django, it prefetch_relatedcan accept not only a string but also a special object that describes which entities to load:

topics_with_filtered_questions = Topic.objects.prefetch_related(Prefetch(
    'questions',
    queryset=Question.objects.filter(text__icontains='fart'),
    to_attr='filtered_questions'
)).all()
for t in topics_with_filtered_questions:
    print(f'{t.title}')
for q in t.filtered_questions:
        print(f' --- {q.text}')
Here we have put the filtered list into a new separate attribute:
SELECT "topic"."id", "topic"."title", "topic"."image_id"
FROM "topic";
SELECT "question"."id", "question"."text", "question"."topic_id"
FROM "question"
WHERE (UPPER("question"."text"::text) LIKE UPPER('%fart%')
AND "question"."topic_id" IN (1, 2));

Example 7

In this example, we’ll take a look at grouping and subqueries. To do this, we will find topics that have more than ten questions. In Django, not the usual methods group and having specified in query set-e. Instead, annotations and conditions in them are used. To find IDs of topics that have more than 10 questions, you need to compose a query like this:

topic_ids = Question.objects.values('topic_id').annotate(
  qs_count=Count('id'),
).filter(qs_count__gt=10)
values - fields by which we group;
annotate - aggregate functions;
filter - HAVING by the field formed by the aggregate function.
Now we use this subquery in the main query:
topics = Topic.objects.filter(id__in=topic_ids.values('topic_id'))
for t in topics:
    print(f'{t.title}')

Note that we had to reuse the main query values to explicitly indicate that the subquery should only leave one column by excluding qs_count.

SELECT "topic"."id", "topic"."title", "topic"."image_id"
FROM "topic"
WHERE "topic"."id" IN (
SELECT U0."topic_id"
FROM "question" U0
GROUP BY U0."topic_id"
HAVING COUNT(U0."id") > 10
);

Note: In this example, you will see two prompts when using the debugger.

Example 8

Let’s look at queries for many-to-many relationships here. For each topic, we want to get a list of users who have access to it. With a naive approach, for each topic there will be an additional request to the base after the list of users:

for t in Topic.objects.all():  # dj.9.1
    names = ', '.join(u.name for u in t.use.all())  # dj.9.2
    print(f'Topic[{t.title}]: {names}')
SELECT "topic"."id", "topic"."title", "topic"."image_id" FROM "topic";  -- dj.9.1
SELECT "user"."id", "user"."name"
FROM "user" INNER JOIN "top_user" ON ("use"."id" = "topic_user"."user_id")
WHERE "top_user"."top_id" = 1; -- dj.9.2
SELECT "user"."id", "user"."name"
FROM "user" INNER JOIN "top_user" ON ("use"."id" = "top_user"."use_id")
WHERE "top_user"."top_id" = 2; -- dj.9.2

If you specify prefetch_related, then users for all topics will be pulled up by one request. There will be two queries in total (the first is for topics) instead of (1 + N).

for t in Topic.objects.prefetch_related('users'):  # dj.9.3
    name = ', '.join(u.name for u in t.use.all())
    print(f'Topic[{t.title}]: {names}')

In this case, both requests are caused by the line dj.9.3:

SELECT "topic"."id", "topic"."title", "topic"."image_id"
FROM "topic";
SELECT ("topic_user"."topic_id") AS "_prefetch_related_val_topic_id", "user"."id", "user"."name"
FROM "user" INNER JOIN "top_user" ON ("use"."id" = "top_user"."use_id")
WHERE "topic_user"."topic_id" IN (1, 2);

If you need information from the intermediate table, then you can refer to the automatically created field for us topicuser_set, which stores information from this junction table.

u = User.objects.get(id=self.u1_id)  # dj.9.4
for topic_info in u.topicuser_set.all():  # dj.9.5
    print(f'{u.name} is {topic_info.role} in topic "{topic_info.topic.title}"')  # dj.9.6
SELECT "user"."id", "user"."name"
FROM "user"
WHERE "user"."id" = 1;  -- dj.9.4
SELECT "topic_user"."id", "topic_user"."topic_id", "topic_user"."user_id", "topic_user"."role"
FROM "topic_user"
WHERE "topic_user"."user_id" = 1;  -- dj.9.5
SELECT "topic"."id", "topic"."title", "topic"."image_id"
FROM "topic"
WHERE "topic"."id" = 1;  -- dj.9.6
SELECT "topic"."id", "topic"."title", "topic"."image_id"
FROM "topic"
WHERE "topic"."id" = 2;  -- dj.9.6

But, as always, you need to be careful and load all the necessary data in advance. In this case, the benefit lies in the fact that within the framework of one request the necessary topics for each post will be pulled up topic_user, and not by request for each topic, as in the first option. The rest of the options are identical: in both, the user and the list are requested from the staging table.

u = User.objects.prefetch_related('topicuser_set', 'topicuser_set__topic').get(id=self.u1_id)
for topic_info in u.topicuser_set.all():
    print(f'{u.name} is {topic_info.role} in topic "{topic_info.topic.title}"')
SELECT "user"."id", "user"."name"
FROM "user"
WHERE "user"."id" = 1;
SELECT "topic_user"."id", "topic_user"."topic_id", "topic_user"."user_id", "topic_user"."role"
FROM "topic_user"
WHERE "topic_user"."user_id" IN (1);
SELECT "topic"."id", "topic"."title", "topic"."image_id"
FROM "topic"
WHERE "topic"."id" IN (1, 2);

Example 9

Let’s get down to the other side of optimization. When inserting a large number of new records, it is recommended to use bulk operations to improve performance. They allow you to insert many rows in one query.

Topic.objects.bulk_create([
    Topic(title=f'topic {i}', image_id=self.i1_id)
for i in range(10)
], batch_size=2)

If there is really a lot of data, do not forget to specify the parameter batch_size. But keep in mind that even if you use a generator, its entire result will be loaded into memory.

INSERT INTO "topic" ("title", "image_id") VALUES
  ('topic 0', 1),
  ('topic 1', 1)
RETURNING "topic"."id";

Please note that RETURNING all data generated by the database itself will be included.

Example 10

The other side of bulk operations is bulk updates :

User.objects.bulk_update(
    [Use(id=1, name='new dj_1 name'), Use(id=2, name='new dj_2 name')],
    ['name'],
)

The second argument here specifies the fields to be updated. Let’s take a look at the generated request:

UPDATE "user"
SET "name" = (
CASE
WHEN ("use"."id" = 1) THEN 'new dj_1 name'
WHEN ("use"."id" = 2) THEN 'new dj_2 name'
ELSE NULL END
    )::text
WHERE "user"."id" IN (1, 2);

It looks strange and, most likely, is not optimal, because for each line you have to calculate CASE. We usually use UPDATE … SET … FROM …. But even this form will save you a lot of time. Perhaps this is because DjangoORM positions itself independent of the database used. If someone dug in this direction deeper, I will be glad to hear an explanation.

Conclusion

These queries, of course, do not exhaust the capabilities of DjangoORM, but the examples shown will be a good starting bag of knowledge and will allow you to move with greater confidence on the path of mastering this library.

Most importantly, don’t limit yourself to the “magic” provided by libraries. Understanding the essence of their work is very useful because even with the right result, you can miss out on suboptimal resource consumption.

Image Source

  1. Image 1 – https://www.google.com/url?sa=i&url=https%3A%2F%2Fsavepearlharbor.com%2F%3Fp%3D321872&psig=AOvVaw39EFi1xNtPAd5t79JrEMZV&ust=1630465561691000&source=images&cd=vfe&ved=2ahUKEwiKxaDWo9ryAhVMIrcAHQXoDecQjRx6BAgAEAk

 

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *