A Detailed Guide on SQL Query Optimization
This article was published as a part of the Data Science Blogathon
Overview of SQL Query Optimization
SQL Query optimization is defined as the iterative process of enhancing the performance of a query in terms of execution time, the number of disk accesses, and many more cost measuring criteria. Data is an integral part of any application. Access to the data should be in the fastest way possible to enhance the user experience while using the application.
It is a very critical task. Even slight changes in SQL queries can improve performance drastically. There is no step-by-step guide for the same. In turn, we must use general guidelines for writing queries, which operators to use. Then check for execution plans and find out hat part of the query takes the most time and rewriting that part in some other way.
Table of contents
- Overview of SQL Query Optimization
- Introduction to SQL
- Query Processing
- Purpose of SQL Query Optimization
- Database for the tutorial
- Metrics for analyzing query performance for SQL Query Optimization
- SQL Query Optimization Techniques
- Frequently Asked Questions
Introduction to SQL
SQL(Structured Query Language) is a standard language for accessing, storing, creating, deleting databases, and perform all other data-related operations. There are four types of language in SQL:
1. DDL(Data Definition Language)
This includes commands for creating and altering the schema of the database. Example: CREATE, DROP, ALTER, TRUNCATE.
2. DML(Data Manipulation Language)
This includes commands for updating the data in the database. It alters the instance of the relation. For example, SELECT, UPDATE, INSERT, DELETE.
3. DCL(Data Control Language)
This includes commands for access to the database. For example, GRANT, REVOKE.
4. TCL(Transaction Control Language)
This includes the commands used to manage transactions in the database. For example, COMMIT, ROLLBACK.
Before going towards the main topic of SQL Query optimization, let us first understand the actual processing of query:
Query processing is defined as the group of phases associated with the extraction of data from the database. It includes conversion of queries written in a high-level language such as SQL into a form that can be understood by the physical level implementation of the database, SQL query optimization techniques, and the original evaluation of the query.
There are three major steps involved in query processing:
1. Parser and translator: The first step in query processing is parsing and translation. Parser just like a parser in compilers checks the syntax of the query whether the relations mentioned are present in the database or not. A high-level query language such as SQL is suitable for human use. But, it is totally unsuitable to system internal representation. Therefore, translation is required. The internal representation can be extended form of relational algebra.
2. Optimization: A SQL query can be written in many different ways. An optimized query also depends on how the data is stored in the file organization. A Query can also have different corresponding relational algebra expressions.
So, the above query can be written in the two forms of relational algebra. So it totally depends on the implementation of the file system which one is better.
3. Execution plan: A systematic step-by-step execution of primitive operations for fetching data from the database is termed a query evaluation plan. Different evaluation plans for a particular query have different query costs. The cost may include the number of disk accesses, CPU time for execution of the query, time of communication in the case of distributed databases.
Purpose of SQL Query Optimization
The major purposes of SQL Query optimization are:
1. Reduce Response Time: The major goal is to enhance performance by reducing the response time. The time difference between users requesting data and getting responses should be minimized for a better user experience.
2. Reduced CPU execution time: The CPU execution time of a query must be reduced so that faster results can be obtained.
3. Improved Throughput: The number of resources to be accessed to fetch all necessary data should be minimized. The number of rows to be fetched in a particular query should be in the most efficient manner such that the least number of resources are used.
Database for the tutorial
We will be using the AdventureWorks database in this tutorial for showing various commands and their optimized solutions. You can download the database from here.
AdventureWorks database is a sample database provided by Microsoft SQL Server. This is a standard database used for showing day-to-day transaction processing for a business. Scenarios include sales, customer management, product management, and human resources.
For more information on the tables, relationships of the database, you can visit this link.
Metrics for analyzing query performance for SQL Query Optimization
There are several metrics for calculating the cost of the query in terms of space, time, CPU utilization, and other resources:
1. Execution Time: The most important metrics to analyze the query performance is the execution time of the query. Execution time/Query duration is defined as the time taken by the query to return the rows from the database. We can find the query duration using the following commands:
SET STATISTICS TIME ON SELECT * FROM SalesLT.Customer;
By using STATISTICS TIME ON, we can see the parse time, compile-time, execution time, and completion time of the query.
Parse and Compile Time: The time taken to parse and compile the query to check the syntax of the query is termed Parse and Compile time.
Execution Time: The CPU time used by the query to fetch the data is termed Execution time.
Completion time: The exact time at which the query returned the result is termed Completion time.
By analyzing these times, we can get a clear picture of whether the query is performing up to the mark or not.
2. Statistics IO:
IO is the major time spend accessing the memory buffers for reading operations in case of query. It provides insights into the latency and other bottlenecks for executing the query. By setting STATISTICS IO ON, we get the number of physical and logical reads performed to execute the query.
Logical reads: Number of reads that were performed from the buffer cache.
Physical reads: Number of reads that were performed from the storage device as they were not available in the cache.
3. Execution Plan:
An execution plan is a detailed step-by-step processing plan used by the optimizer to fetch the rows. It can be enabled in the database using the following procedure. It helps us to analyze the major phases in the execution of a query. We can also find out which part of the execution is taking more time and optimize that sub-part.
SELECT p.Name, Color, ListPrice FROM SalesLT.Product p INNER JOIN SalesLT.ProductCategory pc ON P.ProductCategoryID = pc.ProductCategoryID;
As we can see above, the execution plan shows which tables were accessed, what index scans were performed for fetching the data. If joins are present it shows how these tables were merged.
Further, we can see a more detailed analysis view of each sub-operation performed during query execution. Let us see the analysis of the index scan:
As we can see above, we can get the values of the number of rows read, the actual number of batches, estimated operator cost, estimated CPU cost, estimated subtree cost, number of executions, actual rebinds. This gives us a detailed overview of the several cost involved in query execution.
SQL Query Optimization Techniques
Till now, we have seen how a query is executed and different measures to analyze the query performance. Now we will learn the techniques to optimize the query performance in SQL. There are some useful practices to reduce the cost. But, the process of optimization is iterative. One needs to write the query, check query performance using io statistics or execution plan, and then optimize it. This cycle needs to be followed iteratively for query optimization. The SQL Server itself also finds the optimal and minimal plan to execute the query.
An index is a data structure used to provide quick access to the table based on a search key. It helps in minimizing the disk access to fetch the rows from the database. An indexing operation can be a scan or a seek. An index scan is traversing the entire index for matching criteria whereas index seek is filtering rows on a matching filter.
SELECT p.Name, Color, ListPrice FROM SalesLT.Product p INNER JOIN SalesLT.ProductCategory pc ON P.ProductCategoryID = pc.ProductCategoryID INNER JOIN SalesLT.SalesOrderDetail sod ON p.ProductID = sod.ProductID WHERE p.ProductID>1
In the above query, we can see that a total of 99% of the query execution time goes in index seek operation. Therefore, it is an important part of the optimization process.
Guidelines for choosing index:
1. Indexes should be made on keys that frequently occur in WHERE clause and join statements.
2. Indexes should not be made on columns that are frequently modified i.e UPDATE command is applied on these columns frequently.
3. Indexes should be made on Foreign keys where INSERT, UPDATE, and DELETE are concurrently performed. This allows UPDATES on the master table without shared locking on the weak entity.
4. Indexes should be made on attributes that occur together commonly in WHERE clause using AND operator.
5. Indexes should be made on ordering key values.
Selection of the rows that are required instead of selecting all the rows should be followed. SELECT * is highly inefficient as it scans the entire database.
SET STATISTICS TIME ON SELECT * FROM SalesLT.Product
SET STATISTICS TIME ON SELECT ProductNumber, Name, Color,Weight FROM SalesLT.Product
As we can see from the above two outputs, the time is reduced to one-fourth when we use the SELECT statement for selecting only those columns that are required.
Avoid using SELECT DISTINCT
SELECT DISTINCT command in SQL is used for fetching unique results and remove duplicate rows in the relation. To achieve this task, it basically groups together related rows and then removes them. GROUP BY operation is a costly operation. So to fetch distinct rows and remove duplicate rows, one might use more attributes in the SELECT operation.
Let us take an example,
SET STATISTICS TIME ON SELECT DISTINCT Name, Color, StandardCost, Weight FROM SalesLT.Product
SET STATISTICS TIME ON SELECT Name, Color, StandardCost, Weight, SellEndDate, SellEndDate FROM SalesLT.Product
As we can see from the execution of the above two queries, the DISTINCT operation takes more time to fetch the unique rows. So, it is better to add more attributes in the SELECT query to improve the performance and get unique rows.
Inner joins vs WHERE clause
We should use inner join for merging two or more tables rather than using the WHERE clause. WHERE clause creates the CROSS join/ CARTESIAN product for merging tables. CARTESIAN product of two tables takes a lot of time.
SET STATISTICS IO ON SELECT p.Name, Color, ListPrice FROM SalesLT.Product p, SalesLT.ProductCategory pc WHERE P.ProductCategoryID = pc.ProductCategoryID
SET STATISTICS TIME ON SELECT p.Name, Color, ListPrice FROM SalesLT.Product p INNER JOIN SalesLT.ProductCategory pc ON P.ProductCategoryID = pc.ProductCategoryID
So, we can see from the above outputs that inner join takes almost half time as compared to join using WHERE clause.
The limit command is used to control the number of rows to be displayed from the result set. The result set needs to display only those rows that are required. Therefore, one must use limit with the production dataset and provide an on-demand computation of rows for the production purpose
SET STATISTICS IO ON SELECT Name, Color, ListPrice FROM SalesLT.Product LIMIT 10
The above query prints the top 10 rows of the resultset. This drastically improves the performance of the query.
IN versus EXISTS
IN operator is more costly than EXISTS in terms of scans especially when the result of the subquery is a large dataset. So we should try to use EXISTS rather than using IN for fetching results with a subquery.
Let us see this with an example,
SET STATISTICS TIME ON
SELECT ProductNumber,Name,Color FROM SalesLT.Product WHERE ProductID IN (SELECT ProductID FROM SalesLT.ProductDescription)
SET STATISTICS TIME ON SELECT ProductNumber,Name,Color FROM SalesLT.Product WHERE EXISTS (SELECT ProductID FROM SalesLT.ProductDescription)
We have executed the same query having a subquery with IN command and EXISTS commands and we observe that the EXISTS command takes half of the time as compared to IN command and the number of physical and logical scans is very low.
Loops versus Bulk insert/update
The loops must be avoided because it requires running the same query many times. Instead, we should opt for bulk inserts and updates.
SET STATISTICS TIME ON DECLARE @Counter INT SET @Counter=1 WHILE ( @Counter <= 10) BEGIN PRINT 'The counter value is = ' + CONVERT(VARCHAR,@Counter) INSERT INTO [SalesLT].[ProductDescription] ([Description] ,[rowguid] ,[ModifiedDate]) VALUES ('This is great' ,NEWID() ,'12/01/2010') SET @Counter = @Counter + 1 END
USE [AdventureWorksLT2019] GO SET STATISTICS TIME ON INSERT INTO [SalesLT].[ProductDescription] ([Description] ,[rowguid] ,[ModifiedDate]) VALUES ('This is great' ,NEWID() ,'12/01/2010'), ('New news' ,NEWID() ,'12/01/2010'), ('Awesome product.' ,NEWID() ,'12/01/2010'), .........., ('Awesome product.' ,NEWID() ,'12/01/2010') GO
As we have seen above bulk insert works faster than loop statements.
Frequently Asked Questions
A. Optimizing an SQL query involves various techniques to improve its performance and efficiency. Some common strategies include:
1. Indexing: Create appropriate indexes on columns used in WHERE, JOIN, and ORDER BY clauses to speed up data retrieval.
2. Filtering: Use efficient WHERE clauses to minimize the number of rows processed.
3. Avoid SELECT *: Retrieve only the necessary columns to reduce data transfer.
4. JOIN Optimization: Choose the most efficient JOIN type (INNER JOIN, LEFT JOIN, etc.) and ensure relevant columns are indexed.
5. Subqueries: Optimize subqueries to avoid excessive data processing.
6. Use EXISTS instead of IN: Replace IN clauses with EXISTS for better performance.
7. Limit Data: Use LIMIT or TOP to restrict the number of returned rows, especially for large result sets.
8. Denormalization: Consider denormalizing data for frequently accessed queries to reduce JOIN complexity.
9. Caching: Utilize query caching mechanisms provided by the database system.
10. Review Execution Plan: Analyze the query execution plan and optimize it if necessary using query hints.
11. Partitioning: If applicable, partition large tables to improve data retrieval for specific ranges.
Regularly monitoring query performance and profiling the database system can help identify bottlenecks and further optimize queries as needed.
A. The query optimization process involves analyzing SQL queries to improve their execution efficiency. It includes steps such as parsing the query, generating candidate execution plans, estimating costs for each plan, and selecting the most optimal plan based on cost estimations. Indexing, query rewriting, and other techniques are applied to reduce query execution time and improve overall database performance.
Some other things to keep in mind:
- Avoid using correlated nested queries.
- Avoid inner joins with Equality or OR conditions.
- Check whether records exist before fetching them.
- Use indexing properly, try to create more indexes for fetching composite columns.
- Use Wildcards wisely.
- Try to use WHERE rather than HAVING. Only use HAVING for aggregated values.
So we learned that how minor changes in queries can improve the performance of the query drastically. This will boost the performance of applications providing a better user experience. Keep all the guidelines in mind while writing queries.
And finally, it does not go without saying,
Thanks for reading! The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.