MySQL Secondary Indexes: How and why with examples
One of the most important concepts in SQL is the index. This tutorial will teach you what an index is, how to use it, and what it's good for.
What's an index?
An index is a way for a database to quickly locate a particular set of rows. Indexes use sorted storage or some other form of organization to make it possible to find the value you're looking for without looking at every one. As an example, for 1 million records, an index makes it possible to find one value doing only about 20 comparisons, rather than 1 million comparisons. This massive increase in efficiency is due to the use of data structures that make use of binary search or another efficient search algorithm.
The bottom line is that judicious use of indexes can make your queries much, much faster.
Last week we talked about primary keys in MySQL, which are a special kind of index. Read that first if you're just getting started learning about indexes in MySQL.
This tutorial will cover secondary indexes -- that is, indexes other than primary keys.
Creating an index
We'll start with the example employees
table from the previous blog
post. Here's how we create it.
CREATE TABLE employees (
employee_id bigint,
first_name varchar(100),
last_name varchar(100),
title varchar(80),
PRIMARY KEY (employee_id)
);
We can insert rows into our table like this:
INSERT INTO employees VALUES
(1, 'Tim', 'Sehn', 'CEO'),
(2, 'Brian', 'Hendricks', 'Software Engineer'),
(3, 'Brian', 'Fitzgerald', 'Salesman');
And our data looks like this:
SELECT * FROM employees;
+-------------+------------+------------+-------------------+
| employee_id | first_name | last_name | title |
+-------------+------------+------------+-------------------+
| 1 | Tim | Sehn | CEO |
| 2 | Brian | Hendricks | Software Engineer |
| 3 | Brian | Fitzgerald | Salesman |
+-------------+------------+------------+-------------------+
3 rows in set (0.01 sec)
As our company grows, we'll eventually have thousands of employees. When we build our company directory, we want to be able to search for an employee by name. The query we run will look like this:
SELECT * FROM employees WHERE first_name = 'Brian' AND last_name = 'Fitzgerald';
When we run this query on the table, it has to examine every single row looking for a match. This is called a full table scan, and it destroys your query's performance.
Let's add an index to speed this query up.
CREATE INDEX emp_name ON employees (last_name, first_name);
This statement adds an index on two columns of the employees
table. Let's go over it line by line.
CREATE INDEX
emp_name -- The name of this index. Index names must be unique.
ON employees -- the name of the table to add the index to
(last_name, first_name); -- the table columns to index, in order
You can also use this alternate syntax, which does the same thing.
ALTER TABLE employees ADD KEY (last_name, first_name);
Using our index
Now that we created an index, we can use the EXPLAIN
keyword to see how it affects our query
performance.
EXPLAIN SELECT * FROM employees WHERE first_name = 'Brian' AND last_name = 'Fitzgerald';
+----+-------------+-----------+------------+------+---------------+----------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | employees | NULL | ref | emp_name | emp_name | 806 | const,const | 1 | 100.00 | NULL |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)
This EXPLAIN
output indicates we're using the key emp_name
and matching exactly one row, which
is what we want. Compare to the output without an index defined on these columns.
EXPLAIN SELECT * FROM employees WHERE first_name = 'Brian' AND last_name = 'Fitzgerald';
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | employees | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 33.33 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
Without an index, we have NULL
for our key, and have to scan all 3 rows in the table to get our
result. For large table with many rows, this will yield terrible performance.
Partial index matching
Above, we defined our index on (last_name, first_name)
, so there are two columns in the table. Can
we use just one of them to do a lookup on our index? Well, it depends.
For the last_name
column, yes.
EXPLAIN SELECT * FROM employees WHERE last_name = 'Fitzgerald';
+----+-------------+-----------+------------+------+---------------+----------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------+------+----------+-------+
| 1 | SIMPLE | employees | NULL | ref | emp_name | emp_name | 403 | const | 1 | 100.00 | NULL |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)
That works fine and uses the index. But the first_name
column does not:
EXPLAIN SELECT * FROM employees WHERE first_name = 'Brian';
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | employees | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 33.33 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
Now we're back to a full table scan (NULL
key and ALL
as our type
).
Why the difference? We defined the index as (last_name, first_name)
, and the order matters. MySQL
can use an index if you specify a prefix of its columns (the first one or more columns in it). The
more columns you specify, in the order defined by the index, the better MySQL can use it to match
rows, the fewer rows you need to scan, and the faster your query gets.
Unique indexes
Depending on your data, it might be a good idea to declare your index to be UNIQUE
. Uniqueness is
more commonly declared as a CONSTRAINT
on a table, like this:
CREATE TABLE vendors (
id BIGINT PRIMARY KEY,
name varchar(1024),
tax_id VARCHAR(255),
CONSTRAINT UNIQUE (tax_id)
);
This ensures that nobody accidentally creates a duplicate vendor, since we know that only one company can have a given tax identifier.
Under the covers, MySQL and other databases create an index for every UNIQUE
constraint, because
that's the only way to enforce the constraint efficiently. Accordingly, you can also create a unique
index with the CREATE INDEX
statement.
CREATE UNIQUE INDEX unique_tax_id ON vendors (tax_id);
Null values in indexes
Unlike a primary key, which can never contain NULL
values, indexes can. Using our vendors table
above:
INSERT INTO vendors VALUES (1, 'Turbo plumbing', NULL), (2, 'General contracting', NULL);
Query OK, 2 rows affected (0.01 sec)
Records: 2 Duplicates: 0 Warnings: 0
SELECT * FROM vendors;
+----+---------------------+--------+
| id | name | tax_id |
+----+---------------------+--------+
| 1 | Turbo plumbing | NULL |
| 2 | General contracting | NULL |
+----+---------------------+--------+
2 rows in set (0.00 sec)
So even thought we have a UNIQUE
index on the tax_id
column, I can insert two NULL
values into
it. In MySQL and other SQL databases, NULL
is a special value, not equal to any other value
including NULL
, so it doesn't violate a UNIQUE
constraint.
Sometimes this is what you want, but in our case it's not -- we need every vendor to have a tax
ID. To get this behavior, we need to separately declare the column NOT NULL
to prevent NULL
values from being inserted.
CREATE TABLE vendors (
id BIGINT PRIMARY KEY,
name varchar(1024),
tax_id VARCHAR(255) NOT NULL,
CONSTRAINT UNIQUE (tax_id)
);
Examining indexes
MySQL has some convenience statements used to inspect your database schemas, but they aren't part of standard SQL, so you might not be aware of them if you're used to another dialect. There are two that are handy when it comes to understanding the indexes on a table.
First, SHOW CREATE TABLE
.
SHOW CREATE TABLE vendors;
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table
|
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| vendors | CREATE TABLE `vendors` (
`id` bigint NOT NULL,
`name` varchar(1024) DEFAULT NULL,
`tax_id` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_tax_id` (`tax_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
This statement outputs the entire table definition, including all constraints and indexes.
If you just want to understand the index info for a table, you can use the SHOW INDEXES
statement
instead.
SHOW INDEXES FROM vendors;
+---------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+---------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| vendors | 0 | PRIMARY | 1 | id | A | 2 | NULL | NULL | | BTREE | | | YES | NULL |
| vendors | 0 | unique_tax_id | 1 | tax_id | A | 1 | NULL | NULL | YES | BTREE | | | YES | NULL |
+---------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.02 sec)
Note that the PRIMARY KEY
is an index as well, so it's included in the result.
Secondary indexes on TEXT and BLOB columns
MySQL has a couple types that are used for storing large documents: TEXT
for character data and
BLOB
for binary data. This is useful when you have to store data that's longer than the maximum
length of a row in MySQL (64 KB).
But what if you need to create an index on one of these columns? MySQL doesn't allow that, because it would need to index an unlimited amount of data. Here's what happens when you try:
CREATE TABLE documents(
id bigint,
document text);
Query OK, 0 rows affected (0.03 sec)
CREATE INDEX doc_text ON documents(document);
ERROR 1170 (42000): BLOB/TEXT column 'document' used in key specification without a key length
There's a solution though. When you need to create an index on a TEXT
or BLOB
column, you just
need to choose how many characters or bytes you want to index, rather than the entire document. This
is called the key length.
CREATE INDEX doc_text on documents(document(255));
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
Here we've chosen 255 characters as the length of our key, which means we can efficiently locate records based on their first 255 characters.
Indexing JSON documents
A related problem is indexing JSON documents. A popular technique is to store most of a table's actual data in a JSON document, like this.
CREATE TABLE vendors (
id bigint,
vendor_info json
);
Then we store most of the info about each vendor in the JSON payload.
{
"name": "United Plumbing",
"tax_id": "33-2343509-34"
}
But there's a problem with this technique. JSON
columns are stored the same way that TEXT
columns are in MySQL, which means you can't create an index on them directly. Creating an index on
JSON data is more complex than for other types, and involves two steps.
- Create a generated column on the part of the JSON document you want to index.
- Create an index on the geneerated column.
A generated column is exactly what it sounds like: it's a column whose value always comes from some other column. The syntax to add a new generated column looks like this.
ALTER TABLE vendors ADD COLUMN
vendor_tax_id varchar(255) GENERATED ALWAYS AS (vendor_info->>'$.tax_id');
Now we can add an index on this column no problem.
CREATE INDEX vendor_tax_id ON vendors (vendor_tax_id);
What are indexes good for?
Now that we understand the basics of indexes in MySQL, let's talk about why you would want to use one.
Ensuring data uniqueness
A UNIQUE
index means the database will not allow you to have two rows with the same key. This
constraint lets you reason about your data and make guarantees about its correctness. Let's look at
an example.
CREATE TABLE vendors (
id BIGINT PRIMARY KEY,
name varchar(1024),
tax_id VARCHAR(255) NOT NULL,
CONSTRAINT UNIQUE (tax_id)
);
This table definition guarantees that we can't accidentally create a duplicate company (since every company has its own tax ID).
Fast lookups
Databases are very good at retrieving records using an index, which means queries that filter on a column with an index will be much faster than ones that don't. This can matter a lot when the table has many rows.
Without an index, to find the row you want, the database has to look at every row in the table to
see if it matches the WHERE
clause. This is called a full table scan. You can see that's what's
happening if you run an EXPLAIN
on your query, like this:
EXPLAIN SELECT * FROM employees WHERE employee_id = 1;
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | employees | NULL | const | PRIMARY | PRIMARY | 8 | const | 1 | 100.00 | NULL |
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)
This is what you see when you're using an index (including a PRIMARY KEY
) in your WHERE
clause. Without the key, you see something like this:
mysql> explain select * from employees where employee_id = 1;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | employees | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 33.33 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
You can see from the rows
column that the table with the index examines only 1 row. Without, it
examines every row in the table (only 3 in this case, but for a big table this query would destroy
performance for this query and every other running at the time).
Fast sorting
For analytical queries (ones where you expect to return lots of rows), you will often add an ORDER
BY
clause to make it easier to understand the results. This is especially important when you have a
lot of rows and want to process them in batches using a LIMIT
clause, like this.
SELECT * from EMPLOYEES ORDER BY employee_id LIMIT 1000 OFFSET 2000;
In this scenario, I'm writing a batch job that processes employees 1000 at a time. As the batch job
runs, I update the OFFSET
as I go to get the first batch of 1000, the second batch, and so on.
Without an index, the database gets all the rows in the table, and then sorts them using a merge sort on disk.
EXPLAIN SELECT * from EMPLOYEES ORDER BY employee_id LIMIT 1000 OFFSET 2000;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+----------------+
| 1 | SIMPLE | EMPLOYEES | NULL | ALL | NULL | NULL | NULL | NULL | 3 | 100.00 | Using filesort |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+----------------+
1 row in set, 1 warning (0.00 sec)
The Using filesort
is the indicator that this is happening.
But if you have an index you're sorting by (including a PRIMARY KEY
), the database can much more
efficiently fetch your rows. Here's the same query with an index.
EXPLAIN SELECT * from EMPLOYEES ORDER BY employee_id LIMIT 1000 OFFSET 2000;
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-------+
| 1 | SIMPLE | EMPLOYEES | NULL | index | NULL | PRIMARY | 8 | NULL | 3 | 100.00 | NULL |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
On the first batch and on every subsequent batch, the database uses the index as a natural sorting mechanism and just fetches that many records directly without doing an expensive disk-based sort operation. It can very efficiently calculate how many records it needs to skip, then just read the next 1000 in order. Much, much faster!
Joins
Any time you're joining two or more tables together, indexes are absolutely essential to getting good performance. All the performance problems you run into without indexes on a single table are multiplied by the number of rows in each table.
To drive this home, let's look at a join of three tables defined like this:
create table states (name varchar(100), population int unsigned);
create table cities (name varchar(100), state varchar(100) not null, population int unsigned);
create table people (name varchar(100), city varchar(100) not null);
Let's say that we want a list of people named "John Smith" along with names and populations of the cities and states they live in. We would write a query like this:
select * from people p
join cities c on p.city = c.name
join states s on s.name = c.state
where p.name = "John Smith";
Using some real numbers: let's use the US and say that there are 330,000,000 people
rows, 20,000
cities
rows, and 52 states
rows (we didn't forget you, DC and Puerto Rico). There are about
48,000 people named John
Smith
in the US. So with no primary keys, the database query planner will generate a query execution plan
that starts by finding every "John Smith" row in the people
table. Then for every one, it will
look at every row in the the cities
table for a match. Then for every one of those rows, it will
look at every row in the states
table for a match. This means we end up examining 330,000,000 +
(48,000 * 20,000 * 50)
rows, or about 48.3 billion rows.
In case it wasn't obvious, this is bad. This query isn't returning.
Every modern database will use primary key information to make this join more intelligent, meaning
it examines fewer rows. To be specific: if every table above has a PRIMARY KEY
on its name
column, the database will examine only the 48,000
"John Smith" rows to begin with. Then for each
one, it will examine only a single row in the cities
table, then only a single row in the states
table. This comes out to 48,000 * 3
rows, or around 144,000
of them. This is 10 million times
faster performance.
In short: indexes are the difference between your JOIN
queries working, or not.
Foreign key constraints
Foreign key constraints require the use of an index on the refernced columns of the parent table. If you try to create a foreign key constraint without this index, MySQL won't allow it:
CREATE TABLE parent(
id int,
data varchar(100)
);
Query OK, 0 rows affected (0.02 sec)
CREATE TABLE child (
id int,
constraint foreign key fk (id) references parent (id)
);
ERROR 1822 (HY000): Failed to add the foreign key constraint. Missing index for constraint 'child_ibfk_1' in the referenced table 'parent'
This is because enforcing the foreign key constraint efficiently requires being able to quickly determine whether a value being inserted in the child table is present in the parent table. In other words, it requires an index.
Adding an index to the parent
table fixes our problem and allows the constraint to be
created.
CREATE INDEX parent_id ON parent(id);
Query OK, 0 rows affected (0.03 sec)
CREATE TABLE child (
id int,
constraint foreign key fk (id) references parent (id)
);
Query OK, 0 rows affected (0.04 sec)
Downsides to indexes
You may have read the above and concluded that indexes are great, so why not put one on every column? Why not put one on every combination of columns?
Of course there are downsides to having lots of indexes.
- Creating an index is a blocking operation. On a large table it can take minutes or hours, and during that time you can't write to the table affected.
- Indexes take space, similar to the tables themselves. Each additional index you create increases the amount of storage your database takes to store the same amount of rows.
- Each index on a table makes writes to that table slower. As a rough rule of thumb, every index you have beyond the primary key costs as much as the row itself. So a table with 3 secondary indexes will cost 4 times as much to insert into as a table without them.
So the overall rule is: use an index only where you need one, and try to anticipate those needs ahead of time to avoid expensive downtime or migration costs.
Conclusion
This tutorial should get you on your way using indexes in MySQL. They are essential to writing a production database schema that is easy to use and performs well.
Before you go, did we mention that we built a MySQL-compatible database with built-in version
control called Dolt? Dolt is a great way to experiment with databases, because
you have the safety of version control to fall back on -- if you mess up, you can always dolt reset
--hard
to roll back to a previous revision. Dolt also lets you diff two revisions of your database,
so you can see what you changed whenever you run a command.
Dolt is free and open source, so go check it out! All of the examples in this tutorial work in Dolt's built-in SQL shell, so you can use it to follow along at home.
Have questions about Dolt or MySQL indexes? Join us on Discord to talk to our engineering team and meet other Dolt users.