MySQL Secondary Indexes: How and why with examples

13 min read

One of the most important concepts in SQL is the index. This tutorial will teach you what an index is, how to use it, and what it's good for.

What's an index?

An index is a way for a database to quickly locate a particular set of rows. Indexes use sorted storage or some other form of organization to make it possible to find the value you're looking for without looking at every one. As an example, for 1 million records, an index makes it possible to find one value doing only about 20 comparisons, rather than 1 million comparisons. This massive increase in efficiency is due to the use of data structures that make use of binary search or another efficient search algorithm.

The bottom line is that judicious use of indexes can make your queries much, much faster.

Last week we talked about primary keys in MySQL, which are a special kind of index. Read that first if you're just getting started learning about indexes in MySQL.

This tutorial will cover secondary indexes -- that is, indexes other than primary keys.

Creating an index

We'll start with the example employees table from the previous blog post. Here's how we create it.

CREATE TABLE employees (
    employee_id bigint,
    first_name varchar(100),
    last_name varchar(100),
    title varchar(80),
    PRIMARY KEY (employee_id)
);

We can insert rows into our table like this:

INSERT INTO employees VALUES
    (1, 'Tim', 'Sehn', 'CEO'),
    (2, 'Brian', 'Hendricks', 'Software Engineer'),
    (3, 'Brian', 'Fitzgerald', 'Salesman');

And our data looks like this:

SELECT * FROM employees;
+-------------+------------+------------+-------------------+
| employee_id | first_name | last_name  | title             |
+-------------+------------+------------+-------------------+
| 1           | Tim        | Sehn       | CEO               |
| 2           | Brian      | Hendricks  | Software Engineer |
| 3           | Brian      | Fitzgerald | Salesman          |
+-------------+------------+------------+-------------------+
3 rows in set (0.01 sec)

As our company grows, we'll eventually have thousands of employees. When we build our company directory, we want to be able to search for an employee by name. The query we run will look like this:

SELECT * FROM employees WHERE first_name = 'Brian' AND last_name = 'Fitzgerald';

When we run this query on the table, it has to examine every single row looking for a match. This is called a full table scan, and it destroys your query's performance.

Let's add an index to speed this query up.

CREATE INDEX emp_name ON employees (last_name, first_name);

This statement adds an index on two columns of the employees table. Let's go over it line by line.

CREATE INDEX
    emp_name -- The name of this index. Index names must be unique.
    ON employees -- the name of the table to add the index to
    (last_name, first_name); -- the table columns to index, in order

You can also use this alternate syntax, which does the same thing.

ALTER TABLE employees ADD KEY (last_name, first_name);

Using our index

Now that we created an index, we can use the EXPLAIN keyword to see how it affects our query performance.

EXPLAIN SELECT * FROM employees WHERE first_name = 'Brian' AND last_name = 'Fitzgerald';
+----+-------------+-----------+------------+------+---------------+----------+---------+-------------+------+----------+-------+
| id | select_type | table     | partitions | type | possible_keys | key      | key_len | ref         | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------------+------+----------+-------+
|  1 | SIMPLE      | employees | NULL       | ref  | emp_name      | emp_name | 806     | const,const |    1 |   100.00 | NULL  |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)

This EXPLAIN output indicates we're using the key emp_name and matching exactly one row, which is what we want. Compare to the output without an index defined on these columns.

EXPLAIN SELECT * FROM employees WHERE first_name = 'Brian' AND last_name = 'Fitzgerald';
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | employees | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |    33.33 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)

Without an index, we have NULL for our key, and have to scan all 3 rows in the table to get our result. For large table with many rows, this will yield terrible performance.

Partial index matching

Above, we defined our index on (last_name, first_name), so there are two columns in the table. Can we use just one of them to do a lookup on our index? Well, it depends.

For the last_name column, yes.

EXPLAIN SELECT * FROM employees WHERE last_name = 'Fitzgerald';
+----+-------------+-----------+------------+------+---------------+----------+---------+-------+------+----------+-------+
| id | select_type | table     | partitions | type | possible_keys | key      | key_len | ref   | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | employees | NULL       | ref  | emp_name      | emp_name | 403     | const |    1 |   100.00 | NULL  |
+----+-------------+-----------+------------+------+---------------+----------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)

That works fine and uses the index. But the first_name column does not:

EXPLAIN SELECT * FROM employees WHERE first_name = 'Brian';
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | employees | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |    33.33 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.01 sec)

Now we're back to a full table scan (NULL key and ALL as our type).

Why the difference? We defined the index as (last_name, first_name), and the order matters. MySQL can use an index if you specify a prefix of its columns (the first one or more columns in it). The more columns you specify, in the order defined by the index, the better MySQL can use it to match rows, the fewer rows you need to scan, and the faster your query gets.

Unique indexes

Depending on your data, it might be a good idea to declare your index to be UNIQUE. Uniqueness is more commonly declared as a CONSTRAINT on a table, like this:

CREATE TABLE vendors (
    id BIGINT PRIMARY KEY,
    name varchar(1024),
    tax_id VARCHAR(255),
    CONSTRAINT UNIQUE (tax_id)
);

This ensures that nobody accidentally creates a duplicate vendor, since we know that only one company can have a given tax identifier.

Under the covers, MySQL and other databases create an index for every UNIQUE constraint, because that's the only way to enforce the constraint efficiently. Accordingly, you can also create a unique index with the CREATE INDEX statement.

CREATE UNIQUE INDEX unique_tax_id ON vendors (tax_id);

Null values in indexes

Unlike a primary key, which can never contain NULL values, indexes can. Using our vendors table above:

INSERT INTO vendors VALUES (1, 'Turbo plumbing', NULL), (2, 'General contracting', NULL);
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0

SELECT * FROM vendors;
+----+---------------------+--------+
| id | name                | tax_id |
+----+---------------------+--------+
|  1 | Turbo plumbing      | NULL   |
|  2 | General contracting | NULL   |
+----+---------------------+--------+
2 rows in set (0.00 sec)

So even thought we have a UNIQUE index on the tax_id column, I can insert two NULL values into it. In MySQL and other SQL databases, NULL is a special value, not equal to any other value including NULL, so it doesn't violate a UNIQUE constraint.

Sometimes this is what you want, but in our case it's not -- we need every vendor to have a tax ID. To get this behavior, we need to separately declare the column NOT NULL to prevent NULL values from being inserted.

CREATE TABLE vendors (
    id BIGINT PRIMARY KEY,
    name varchar(1024),
    tax_id VARCHAR(255) NOT NULL,
    CONSTRAINT UNIQUE (tax_id)
);

Examining indexes

MySQL has some convenience statements used to inspect your database schemas, but they aren't part of standard SQL, so you might not be aware of them if you're used to another dialect. There are two that are handy when it comes to understanding the indexes on a table.

First, SHOW CREATE TABLE.

SHOW CREATE TABLE vendors;
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table   | Create Table

     |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| vendors | CREATE TABLE `vendors` (
  `id` bigint NOT NULL,
  `name` varchar(1024) DEFAULT NULL,
  `tax_id` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `unique_tax_id` (`tax_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

This statement outputs the entire table definition, including all constraints and indexes.

If you just want to understand the index info for a table, you can use the SHOW INDEXES statement instead.

SHOW INDEXES FROM vendors;
+---------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table   | Non_unique | Key_name      | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+---------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| vendors |          0 | PRIMARY       |            1 | id          | A         |           2 |     NULL |   NULL |      | BTREE      |         |               | YES     | NULL       |
| vendors |          0 | unique_tax_id |            1 | tax_id      | A         |           1 |     NULL |   NULL | YES  | BTREE      |         |               | YES     | NULL       |
+---------+------------+---------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
2 rows in set (0.02 sec)

Note that the PRIMARY KEY is an index as well, so it's included in the result.

Secondary indexes on TEXT and BLOB columns

MySQL has a couple types that are used for storing large documents: TEXT for character data and BLOB for binary data. This is useful when you have to store data that's longer than the maximum length of a row in MySQL (64 KB).

But what if you need to create an index on one of these columns? MySQL doesn't allow that, because it would need to index an unlimited amount of data. Here's what happens when you try:

CREATE TABLE documents(
    id bigint, 
    document text);
Query OK, 0 rows affected (0.03 sec)

CREATE INDEX doc_text ON documents(document);
ERROR 1170 (42000): BLOB/TEXT column 'document' used in key specification without a key length

There's a solution though. When you need to create an index on a TEXT or BLOB column, you just need to choose how many characters or bytes you want to index, rather than the entire document. This is called the key length.

CREATE INDEX doc_text on documents(document(255));
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Here we've chosen 255 characters as the length of our key, which means we can efficiently locate records based on their first 255 characters.

Indexing JSON documents

A related problem is indexing JSON documents. A popular technique is to store most of a table's actual data in a JSON document, like this.

CREATE TABLE vendors (
    id bigint, 
    vendor_info json
);

Then we store most of the info about each vendor in the JSON payload.

{
  "name": "United Plumbing",
  "tax_id": "33-2343509-34"
}

But there's a problem with this technique. JSON columns are stored the same way that TEXT columns are in MySQL, which means you can't create an index on them directly. Creating an index on JSON data is more complex than for other types, and involves two steps.

  1. Create a generated column on the part of the JSON document you want to index.
  2. Create an index on the geneerated column.

A generated column is exactly what it sounds like: it's a column whose value always comes from some other column. The syntax to add a new generated column looks like this.

ALTER TABLE vendors ADD COLUMN 
    vendor_tax_id varchar(255) GENERATED ALWAYS AS (vendor_info->>'$.tax_id');

Now we can add an index on this column no problem.

CREATE INDEX vendor_tax_id ON vendors (vendor_tax_id);

What are indexes good for?

Now that we understand the basics of indexes in MySQL, let's talk about why you would want to use one.

Ensuring data uniqueness

A UNIQUE index means the database will not allow you to have two rows with the same key. This constraint lets you reason about your data and make guarantees about its correctness. Let's look at an example.

CREATE TABLE vendors (
    id BIGINT PRIMARY KEY,
    name varchar(1024),
    tax_id VARCHAR(255) NOT NULL,
    CONSTRAINT UNIQUE (tax_id)
);

This table definition guarantees that we can't accidentally create a duplicate company (since every company has its own tax ID).

Fast lookups

Databases are very good at retrieving records using an index, which means queries that filter on a column with an index will be much faster than ones that don't. This can matter a lot when the table has many rows.

Without an index, to find the row you want, the database has to look at every row in the table to see if it matches the WHERE clause. This is called a full table scan. You can see that's what's happening if you run an EXPLAIN on your query, like this:

EXPLAIN SELECT * FROM employees WHERE employee_id = 1;
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table     | partitions | type  | possible_keys | key     | key_len | ref   | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | employees | NULL       | const | PRIMARY       | PRIMARY | 8       | const |    1 |   100.00 | NULL  |
+----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.01 sec)

This is what you see when you're using an index (including a PRIMARY KEY) in your WHERE clause. Without the key, you see something like this:

mysql> explain select * from employees where employee_id = 1;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | employees | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |    33.33 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

You can see from the rows column that the table with the index examines only 1 row. Without, it examines every row in the table (only 3 in this case, but for a big table this query would destroy performance for this query and every other running at the time).

Fast sorting

For analytical queries (ones where you expect to return lots of rows), you will often add an ORDER BY clause to make it easier to understand the results. This is especially important when you have a lot of rows and want to process them in batches using a LIMIT clause, like this.

SELECT * from EMPLOYEES ORDER BY employee_id LIMIT 1000 OFFSET 2000;

In this scenario, I'm writing a batch job that processes employees 1000 at a time. As the batch job runs, I update the OFFSET as I go to get the first batch of 1000, the second batch, and so on.

Without an index, the database gets all the rows in the table, and then sorts them using a merge sort on disk.

EXPLAIN SELECT * from EMPLOYEES ORDER BY employee_id LIMIT 1000 OFFSET 2000;
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+----------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra          |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+----------------+
|  1 | SIMPLE      | EMPLOYEES | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    3 |   100.00 | Using filesort |
+----+-------------+-----------+------------+------+---------------+------+---------+------+------+----------+----------------+
1 row in set, 1 warning (0.00 sec)

The Using filesort is the indicator that this is happening.

But if you have an index you're sorting by (including a PRIMARY KEY), the database can much more efficiently fetch your rows. Here's the same query with an index.

EXPLAIN SELECT * from EMPLOYEES ORDER BY employee_id LIMIT 1000 OFFSET 2000;
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-------+
| id | select_type | table     | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-------+
|  1 | SIMPLE      | EMPLOYEES | NULL       | index | NULL          | PRIMARY | 8       | NULL |    3 |   100.00 | NULL  |
+----+-------------+-----------+------------+-------+---------------+---------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)

On the first batch and on every subsequent batch, the database uses the index as a natural sorting mechanism and just fetches that many records directly without doing an expensive disk-based sort operation. It can very efficiently calculate how many records it needs to skip, then just read the next 1000 in order. Much, much faster!

Joins

Any time you're joining two or more tables together, indexes are absolutely essential to getting good performance. All the performance problems you run into without indexes on a single table are multiplied by the number of rows in each table.

To drive this home, let's look at a join of three tables defined like this:

create table states (name varchar(100), population int unsigned);
create table cities (name varchar(100), state varchar(100) not null, population int unsigned);
create table people (name varchar(100), city varchar(100) not null);

Let's say that we want a list of people named "John Smith" along with names and populations of the cities and states they live in. We would write a query like this:

select * from people p
    join cities c on p.city = c.name
    join states s on s.name = c.state
    where p.name = "John Smith";

Using some real numbers: let's use the US and say that there are 330,000,000 people rows, 20,000 cities rows, and 52 states rows (we didn't forget you, DC and Puerto Rico). There are about 48,000 people named John Smith in the US. So with no primary keys, the database query planner will generate a query execution plan that starts by finding every "John Smith" row in the people table. Then for every one, it will look at every row in the the cities table for a match. Then for every one of those rows, it will look at every row in the states table for a match. This means we end up examining 330,000,000 + (48,000 * 20,000 * 50) rows, or about 48.3 billion rows.

In case it wasn't obvious, this is bad. This query isn't returning.

Every modern database will use primary key information to make this join more intelligent, meaning it examines fewer rows. To be specific: if every table above has a PRIMARY KEY on its name column, the database will examine only the 48,000 "John Smith" rows to begin with. Then for each one, it will examine only a single row in the cities table, then only a single row in the states table. This comes out to 48,000 * 3 rows, or around 144,000 of them. This is 10 million times faster performance.

In short: indexes are the difference between your JOIN queries working, or not.

Foreign key constraints

Foreign key constraints require the use of an index on the refernced columns of the parent table. If you try to create a foreign key constraint without this index, MySQL won't allow it:

CREATE TABLE parent(
    id int, 
    data varchar(100)
);
Query OK, 0 rows affected (0.02 sec)

CREATE TABLE child (
    id int, 
    constraint foreign key fk (id) references parent (id)
);
ERROR 1822 (HY000): Failed to add the foreign key constraint. Missing index for constraint 'child_ibfk_1' in the referenced table 'parent'

This is because enforcing the foreign key constraint efficiently requires being able to quickly determine whether a value being inserted in the child table is present in the parent table. In other words, it requires an index.

Adding an index to the parent table fixes our problem and allows the constraint to be created.

CREATE INDEX parent_id ON parent(id);
Query OK, 0 rows affected (0.03 sec)

CREATE TABLE child (
    id int, 
    constraint foreign key fk (id) references parent (id)
);
Query OK, 0 rows affected (0.04 sec)

Downsides to indexes

You may have read the above and concluded that indexes are great, so why not put one on every column? Why not put one on every combination of columns?

Of course there are downsides to having lots of indexes.

  • Creating an index is a blocking operation. On a large table it can take minutes or hours, and during that time you can't write to the table affected.
  • Indexes take space, similar to the tables themselves. Each additional index you create increases the amount of storage your database takes to store the same amount of rows.
  • Each index on a table makes writes to that table slower. As a rough rule of thumb, every index you have beyond the primary key costs as much as the row itself. So a table with 3 secondary indexes will cost 4 times as much to insert into as a table without them.

So the overall rule is: use an index only where you need one, and try to anticipate those needs ahead of time to avoid expensive downtime or migration costs.

Conclusion

This tutorial should get you on your way using indexes in MySQL. They are essential to writing a production database schema that is easy to use and performs well.

Before you go, did we mention that we built a MySQL-compatible database with built-in version control called Dolt? Dolt is a great way to experiment with databases, because you have the safety of version control to fall back on -- if you mess up, you can always dolt reset --hard to roll back to a previous revision. Dolt also lets you diff two revisions of your database, so you can see what you changed whenever you run a command.

Dolt is free and open source, so go check it out! All of the examples in this tutorial work in Dolt's built-in SQL shell, so you can use it to follow along at home.

Have questions about Dolt or MySQL indexes? Join us on Discord to talk to our engineering team and meet other Dolt users.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.