Master SQLAlchemy Relationships in a Performance Friendly Way
March 09, 2020Pauline Huguenel6 min read
In a web application, performance is as important as any functional feature. In my current project, we experienced troubles with this aspect as we handled big databases with complex models and relationships. It worked fine when we tested the product with future users. But as soon as it got into production, the volume of data increased significantly. That's when the performance issues came out. We learned some useful stuff while fixing those issues. I will share them with you in this article.
First, each access to the database costs time, so you want to emit as few SQL statements as possible. The key to keeping this number stable when your data volume increases is to have a constant number of SQL statements emitted no matter the number of objects retrieved.
When handling objects with relationships, the query result is composed of nested objects and arrays. We learned the hard way that in those cases, serializing the query result to python object is an expensive operation. The easiest way to reduce the cost of serialization is to only retrieve the objects you need, nothing more.
Using the right loading technique for your relationships will help you reduce the number of SQL requests you emit. Also, it will allow you to retrieve only the relationships you need at the moment. Combined with better handling of query filters, it will allow you to reduce the serialization time of your queries.
1. Limit the number of queries emitted
SQLAlchemy provides several loading techniques to retrieve your relationships. You can specify the one you want to use when declaring a relationship, with the parameter
There are 6 different loading methods. Here are, in my opinion, the 2 more useful :
- Lazy loading: This technique is the default one. It will be used if you omit the
lazyparameter in your relationship or if you specify
lazy="select". This way of loading emits a separate query when first accessing the attribute.
class Parent(Base): __tablename__ = 'parent' id = Column(Integer, primary_key=True) child_id = Column(Integer, ForeignKey('child.id')) child = relationship("Child", lazy="select") # emits a SELECT statement to get the parent object parent = session.query(Parent).first() # emits a second SELECT statement to get the child object child = parent.child
- Eager loading: The eager loading technique, also called joined loading, specify that you always want to retrieve the relationship. When accessing the parent element, SQLAlchemy will also join the relationship table. This way, when accessing the relationship attributes, it doesn't emit another SQL statement. The relationship is loaded when accessing the parent object.
class Parent(Base): __tablename__ = 'parent' id = Column(Integer, primary_key=True) child_id = Column(Integer, ForeignKey('child.id')) child = relationship("Child", lazy="joined") # emits a SELECT statement to get the parent object and its children parent = session.query(Parent).first() # does not emit a second SELECT statement as child object is already loaded child = parent.child
Choosing the right loading technique depends on when and how you use your relationships. As I said earlier, you want the number of requests to remain constant when the data volume evolves. That is why using the default loading technique can be bad for your app performance. If you retrieve multiple objects and their relationships, it is better to use the eager loading technique, as you will emit only one SELECT statement, no matter the number of objects you retrieve.
The trick here is that when developing a feature, you often manipulate a small number of objects. The lazy loading technique gets slow only when reaching a certain amount of data. So, measure the execution time with a realistic data volume. Also, if you know that the number of objects you manipulate will grow in time, you would better use the eager loading technique so the number of queries stays constant.
Be careful though not to use the eager loading technique too much. As I said before, retrieving useless objects is expensive. That's why I rarely choose this loading technique directly on the relationship declaration. I prefer to choose it on a query level, only when I'm sure I need the relationship. I will explain in the next section how you can achieve that.
2. Do not retrieve useless objects
As I said before, the serialization of query results in python objects is expensive. We noticed that executing the same SQL query on the database and with the ORM did not take the same time. It was faster on the database than with the ORM. The difference between the two is that the result on the database is raw data, while with the ORM you get nested objects. So we deduced that this was indeed the serialization that was increasing the request time.
There are two main things you can do to limit that :
- Only retrieve object relationships when you need them. The deeper you go on the data you retrieve, the longer it will take to serialize the query result. You can do that by choosing your loading technique on a query level.
- Do not wait to have python objects to filter the one you need. Likely, you can directly filter the data in the query, even if you need to filter by a relationship attribute.
To illustrate those two points, I will use the following example. You can see two models with one-to-one relationships, including a self-referential relationship.
class Person(Base): __tablename__ = 'person' id = Column(Integer, primary_key=True) firstname = Column(Text) lastname = Column(Text) address_id = Column(Integer, ForeignKey('address.id')) address = relationship('Address') manager_id = Column(Integer, ForeignKey(person.id)) manager = relationship('Person') class Address(Base): __tablename__ = 'address' id = Column(Integer, primary_key=True) number = Column(Integer) street = Column(Text) city = Column(Text) country = Column(Text)
Choose the loading technique on a query level
If you look at the example, you will notice that I did not specify a loading technique on the relationships. As I mentioned earlier, it is better to declare the loading technique in the query than in the relationship. This way, you can use different techniques depending on your needs. It will prevent you to retrieve relationships when you don't need them just because you set eager loading as the default technique.
Imagine you usually manipulate the
Person object without their managers. The lazy loading is a good loading technique as you won't retrieve relationships you do not need. But then you want to return the list of all Persons with their managers for some purpose. You will want to eager load the managers with the persons so that you only make one
SELECT statement. Here is how to do it :
joinedload option here has the same effect as if it was set in the relationship declaration. The difference is just that is will take effect only in one query, so you can use only if you need to get the relationship along with the main object.
The load options you can use in a query are the same that you can use in a relationship declaration. This allows you to have your default loading technique in the relationship but choose another in a query if needed.
Filter your queries by relationship attributes
Let's use the example from above. Imagine you want to get all the
Person instances that live in Paris. You may get all the instances and then filter them with a python function. This is not the best way to do it. Filtering in the SQL query will be faster. First, because this language is optimized for filtering and other query operations. Also, the more objects you get, the longer it will take to serialize them to python objects.
Filtering by relationship attributes is quite simple. The only thing you have to do is to join your relationship before filtering, like this:
Person.query.join(Person.address).filter(Adress.city == 'Paris').all()
Now, you may encounter the need to filter on a self-referential relationship. To achieve that, you have to alias at least one side of your expression, so you can reference the table unambiguously. There are two ways of achieving that.
The first way is a bit verbose but clear and very flexible. You first define an alias for the table, here
Person. By using it on the
join, you will be able to explicitly tell which side of the relationship you are referring to when accessing attributes.
from sqlalchemy.orm import aliased Manager = aliased(Person) Person.query.filter(Person.firstname = 'Pauline').\ join(Manager, Person.manager).\ filter(Manager.firstname = 'Pierre').\ all()
The second option is less verbose, but I find it confusing. In the
join, you can use the property aliased. With that property, all use of the
Person entity after the join will refer to the alias. You lose some flexibility as the sequencing of the functions is important here.
Person.query.filter(Person.firstname = 'Pauline').\ join(Person.manager, aliased=True).\ filter(Person.firstname = 'Pierre').\ all()
To help you improve your app performances, you may want to spot where you are lazy loading some relationships. If you use the raise loading strategy, with
lazy="raise" in your relationship, SQLAlchemy will raise an error each time you try to retrieve a relationship without eager loading it first. This can be helpful to be sure you keep a constant number of SQL statements on big objects you manipulate.