SQL Indexes for Developers: A Practical Guide
In this article, I use examples based on MySQL 8.0. If you’re working with relational databases that support indexing, such as PostgreSQL, MariaDB, or SQL Server, this article is for you.
Introduction - anecdote
Indexes are a fundamental feature of relational databases. They optimize data retrieval, allowing applications to access information more quickly.
The topic of indexes is extensive. I encourage you to set aside at least half an hour, brew yourself a cup of coffee, and read this article carefully. Indexes are somewhat of a cornerstone in database management. If you understand how they work and learn to leverage them in your application, you will elevate your career to a higher level.
To truly benefit from indexes, you need to understand how they function and how your application operates. This article will guide you on how to effectively use indexes.
So, shall we begin?
Let me share an anecdote that illustrates the importance of indexes for your application.
In my city, there are two dental clinics. Both offer the same services at the same price. Despite one being closer to my home, I choose to visit the one that is further away. Why? Because the service is significantly more efficient than at the other clinic.
At the first clinic, the one closer to me, they use a physical book with a list of patients. The staff adds a new client on a new page when they come in for their first visit, which means finding a patient in the register takes several minutes. The staff must flip through each page until they locate the information.
In contrast, the second clinic, which I have chosen, has a faster registration process. This efficiency is due to the sorting system used by the registration staff. They organize clients by date of birth. When a patient provides their date of birth, the staff can directly access the page containing their information.
As you can see, the solution you'll read about shortly exists in your environment. You've likely encountered it yourself more than once! With this realization, it becomes easier to grasp the technical aspects of this topic.
Introduction - what is it?
An index is created by the developer; the database will not do this for you.
Once you've created an index, the database maintains it. Every modification of data is reflected in the index.
A quick note: This article is written by a developer for developers. To convey practical knowledge, I will avoid convoluted explanations of the B-tree structure. If you're interested in that, you can start with Wikipedia.
An index is a specialized way of storing data that accelerates its retrieval. Just like in the anecdote about patient registration at a clinic, databases operate similarly. An index helps quickly locate specific information without scanning the entire table. Without an index, the database would need to review every row, which takes time. An index reduces this time by indicating where the sought-after data is located.
The B-tree is the method by which the database organizes the index to ensure fast access to data.A B-tree is designed to always remain "balanced"—it has few levels, which means reaching any element requires only a few steps, even when dealing with a large volume of data. When we add or remove data, the B-tree structure automatically updates itself to maintain this balance.
This is what a B-tree structure looks like. (Image from Wikipedia)
Enough theory. Let's move on to practical examples from real situations we encounter in our daily work.
Index structure
Let’s take a simple table of products as our example.
Let’s create an index on the “title” column.
Here’s an overview of the structure that has been created in the database.
Now, let’s summarize what has happened.
We utilized an index on the products table. The index is on the “title” column, and the values from this column are stored in the index structure. Now we can see how an index operates. An index can consist of a single column or multiple columns, which we will explore further in the next sections.
Principles of operation
The B-tree structure, in which the index operates, imposes certain principles of functionality. By understanding these principles, we can consciously harness the power of indexes. Here are three main assumptions that will ensure your queries are fast.
Direct Data Retrieval
The primary benefit of indexes, which we discussed at the outset, is how data is retrieved from them. Thanks to the B-tree structure, we can quickly access the information we need.
Imagine a customer enters an online store. They are interested in laptops, so they type “laptop” into the search bar.
For simplicity, let’s say the following query is sent to the database:
Since we have an index on the “title” column, the database will utilize this index.
Here’s how the process of searching for data in the index looks:
Thanks to the fact that the data is sorted alphabetically, we can quickly access laptops by comparing the searched title.
If we were to log what happens during the search process, it would look like this:
laptop < shoes -> true, turn left
laptop < jacket -> false, turn right
laptop = laptop -> found
From Left to Right
An index can be created from one or more columns. The order in which we specify the columns when creating the index matters.
Take a look at the example below. It illustrates this concept.
Let’s create an index that includes both the title and the category of the product:
Here’s a visualization of the index structure.
Now, for comparison, let’s reverse the order of the columns:
And here is the second structure.
As shown in the example above, the order in which columns are defined is critically important. As a developer, you must know how to define an index based on the application's access patterns.
If we start with columns that have more unique values, it will be a better approach. In this case, that is the “category.” In the database, we have fewer categories than product names, so the category is more unique.
A smaller pool of values means fewer steps in the index to locate the desired value. This is exactly what the second index, “idx_category_title,” illustrates.
Scan in one direction
The last property of index operation is the scan direction. By understanding this principle, you can efficiently retrieve data resources using only the index.
You already know that the data in the index is sorted. The default sorting order is ascending (ASC).
Let’s execute a query and see what happens in the index:
Since the data is sorted, we can retrieve what we need from the index without burdening the table with all the products. The records for “laptop” and “watch” are already in alphabetical order, so we just need to return them. There is no need to create additional temporary arrays or files to sort the data.
Summary of Operational Principles
By understanding these principles, you know how an index holds data and how to access it. You understand why searching is fast.
I’ll introduce one more concept: random I/O and sequential I/O.
Data is stored on disk, so when querying it, we need to retrieve it from the disk. However, the data in the “products” table is stored in various locations on the disk. If we need to fetch all products, we have to pull data from different places on the disk, which constitutes random I/O.
Such operations take more time and resources. An index organizes and sorts data, allowing us to minimize the amount of data that must be pulled directly from the disk. When we know precisely which data we need and where it is located, we refer to this as sequential I/O.
Now, let’s move forward and expand on these principles in practice.
Covering Index
When all the data we need to retrieve from the database is contained within the index, we refer to this as a Covering Index.
Look at the following example:
The above query does not need to access the main product table at all. All the necessary data is present in the index, making the query ultra-optimized and fast.
As you might have noticed, the index structure holds identifiers (specifically references) to the main product table. Notice that when we request both the title and the product ID, the query still retrieves data from the index.
When do we truly utilize the index to its fullest potential? This can be easily verified. We can use the EXPLAIN statement by placing it before the query.
Here’s what the database will return.
In the “Extra” column, you’ll see “Using index.” This means that the query is fully covered by the index.
With EXPLAIN, we can check how the database intends to execute the query. The query itself is not executed; you could say it's a simulation of its execution.COUNT on index
Checking how many records are in a table can be a time-consuming process, especially when the table contains a lot of data.
Using COUNT(*) for a database means, “Hey database! Tell me how many records are in the table, and do it as quickly as you can.”
So, if we’re aiming for speed, why not utilize the index? If we have an index that contains unique values, such as email addresses in a user table, the database will use that index to return the total number of records in the table.
Let’s check what EXPLAIN shows for the above query.
Our result was retrieved from the index without accessing the main user table at all. This is yet another advantage of indexes. They are widely used in database queries to improve performance.
Reasons why it doesn’t work
Inequality operator
Queries that use the inequality operator (!=) inherently have a negative impact on query performance. To return accurate results, the database must read all records, discard those that do not match the condition, and keep only those that do.
Consider this query:
You already know how data is retrieved from the index, which is designed to quickly access the values you are looking for. However, if you request all products except shoes, the index may not be utilized, leading the database to read all records from the table instead. This means that even though you might have an index on the “title” column, it won't help optimize this particular query, resulting in a full table scan.
Pattern matching
Similarly, when we want to search for records using a pattern, we encounter limitations. Look at the following query:
In this case, the index cannot be used. The data in the index is sorted in ascending order by default, so we cannot compare the fragment “ho” with the word “shoes” because we do not know the preceding part of the text. Without knowing the initial fragment of the text, we cannot leverage the index effectively.
On the other hand, consider this query:
Here, we have a situation where we know the starting fragment of the text, allowing us to easily utilize the index. In this case, the database can quickly access the relevant portion of the index to retrieve the records that match the pattern.
ORDER BY DESC
The index holds data that is sorted in ascending order. As a result, a query like the one below may not utilize the index effectively.
We want to retrieve orders from a store for a specific day and return the results sorted from the most recently added. For clarity, we have created an index on the “created_at” column.
However, executing the following query might cause the index to be ignored:
Since the data in the index is sorted in ascending order by default, the database may determine that it is more efficient to skip the index altogether and directly access the main orders table.
A solution to this issue is to change the sorting of the data in the index. We can create the index as follows:
Now, by specifying descending order (DESC), the results can be efficiently retrieved using the index. This allows the database to quickly access the relevant data in the desired order, optimizing performance for this query.
Multiple indexes in one query
When we have multiple indexes on the same column, a question arises: which one will be used and why that particular one?
Let’s look at an example with a user table.
As you can see, the “email” column contains unique values, while the “name” column may have duplicates.
Let’s check the state of indexes on the user table:
Take note of the “Cardinality” column. It contains the approximate number of unique values.
Now, let’s run a query that utilizes two columns, each having its own index:
In this case, the index on the “email” column was utilized. This is expected because the email is unique and has a higher cardinality.
The database will choose the index with the higher cardinality, which is a key factor in determining the efficiency of the query. Higher cardinality indicates that there are more unique values, allowing the database to narrow down the search more effectively. Thus, in queries involving multiple indexes, understanding the cardinality of the indexed columns can guide us in predicting which index will be used.
Index obfuscation
Consider the following queries, where the index will not be utilized:
In this case, an index exists on the “price” column.
An index exists on the “created_at” column.
An index exists on the “email” column.
When an SQL function is applied to a column with an index, or calculations are performed (as in the first example), the index cannot be used. Referring back to the beginning of the article, remember that the index structure stores values in a way that facilitates direct comparisons. Using a function or calculation on the indexed column disrupts this straightforward comparison, preventing the database from efficiently leveraging the index.
If you want to apply a specific function and still benefit from an index, you can create an index on the result of the function itself:
Similarly, if you need to perform a calculation on a column with an index and want to ensure that the index is used, you can apply what’s called “Redundant conditions.” Here’s how it would look:
In the example above, “price > 100” is redundant, but it enables the use of the index.
Remember to use EXPLAIN with your queries to check if the index is actually being utilized in such cases.
Type juggling
Type juggling can prevent a query from using an index, and identifying this issue might not be straightforward.
Take a look at the following entity with the index definition “idx_order_id” on the column “order_id,” which is of type STRING.
Now, let’s execute the following query:
As you can see, order_id is provided as an INT in this query. Such cases can occur, and the query will still work correctly. However, the index will not be used. Why?
The database essentially interprets the query like this:
Because order_id is of type STRING, it must be converted using the CAST() function. As we know from earlier, this obscures the index, preventing it from being used. Be mindful of situations like this.
Too few records in the table
The database optimizes the query before executing it. If there are few records in the table—say, fewer than a thousand—the index likely won't be used. Database engines are optimized, and a FULL TABLE SCAN can sometimes be more efficient in terms of I/O than using an index to fetch records stored in random disk locations.
Keep this in mind when testing indexes locally.
It's also worth remembering when testing queries in a development environment, where there are usually fewer records. Queries may be analyzed and optimized differently by the database there compared to a production environment.
Multiple-column index
A multi-column index can be a significant optimization. It's important to understand the principles of how such an index operates to make the most of it. As you'll see shortly, using EXPLAIN will be crucial in this case.
Familiarize yourself with the data that will be used in this example.
Here are the two key principles of how multi-column indexes work:
Order Matters: If you skip any column in a multi-column index, the query will stop using the index.
What does this order mean in practice? Take a look at the following index. We’re trying to retrieve products that have photos, a description, and stock availability, so we set up an index on these columns.
Let’s run a query and see what EXPLAIN shows:
Great, we can see that the index is fully used.
Now, let’s omit the has_photos column and see what happens:
Skipping the first column in the index makes it clear that the index won’t be used. So, remember that you need to include each column in the WHERE clause when using a multi-column index. Alternatively, you can skip a column only from the end, such as quantity.
Using a range query stops index use on subsequent columns:
In this example, we will use a range query to filter out products that do not have stock available. To illustrate this principle, let’s create the following index:
Let’s execute a query:
Now, let's run another query:
Let’s compare the output of both EXPLAIN statements. Both look similar, and the key_len column is the same. In the first example, when the index is fully utilized, key_len equals 9. The key_len essentially reflects how well the index is being used. It ideally shows the utilization of the index in the case of a multiple-column index.
The key point is that using a range operator stops further use of the index. Keep this in mind when designing your indexes.
Summary
Indexes are powerful tools for boosting database performance, but using them effectively requires understanding the nuances—such as order, range queries, and conditions that prevent their usage. Mastering these can help you optimize queries and make smarter design choices for robust, high-speed applications.