How Dolt Types Work

9 min read

UPDATED FEBRUARY 10, 2021: Updated the final table with the types that have been added to Dolt since the article was first written.

When we started on Dolt, our goal was to apply Git's idea of versioning to data. Whereas Git versions files, Dolt versions tables. As the creator of a database, you have the choice of how to represent any value. We chose to adopt the idea of having a schema for each of our tables, meaning that any one column has a definition of the type of data that it represents, and all cells under that column adhere to that type.

As an example, let's take a look at the value 1, and let's store it in a column we'll refer to as val.

val
1

If someone were to retrieve the value, what would it represent? Would it represent a count? Would it instead represent a true/false value? Perhaps it represents a character encoded in the ASCII format to denote the start of a message header? It is ambiguous. This could be fixed by using a more descriptive column name.

Number Of Books
1
IV
1101

This now highlights another problem, in that we can add any values that we want, so we inserted the values IV (which is 4 in Roman numerals) and 1101 (which is 13 in binary). Each of these are considered valid values in the column by its name alone. By having a type, the database can verify that the value given conforms to the type, converting if need be, or rejecting if the value is not understood to map to the given type.

Number Of Books int32
1
4
13

This type information is even more handy in a database built for sharing like Dolt. Have you ever tried importing a CSV with N/A in a numerical column to represent NULL? We have. It's more work than it should be. Moreover, the type information of a column communicates the schema creator’s intent. It's like documentation that is built-in.

First there was Noms

We initially built Dolt on top of Noms. Although we have made significant changes along the way, their implementation of types was our starting point. In Noms, each value is stored on disk as an array of bytes, and those bytes are prefixed with an integer.

Kind                    Kind
  ↓  Data→                ↓  Data→
-------------------------------------------------
| 1 |   |   |   |   |   | 3 |   |   |   |   |   |
-------------------------------------------------

This integer represents the kind, which is Noms' way of stating the type that those bytes represent, and also how to interpret those bytes. Noms has a general-purpose type system, but for our purpose of collaborating on tabled data, we started using the three scalar types:

Type Description
bool represents either true or false
number IEEE-754 double precision number
string an array of characters

Each type in Noms has a few properties, such as defining equality, sortability, how to encode and decode from a byte array, and also how to hash the value. When we began working on Dolt, we added our own types into the mix, while also renaming number to float64.

Type Description
bool represents either true or false
float64 IEEE-754 double precision number
int64 integers from -9223372036854775808 to 9223372036854775807
string an array of characters
uint64 integers from 0 to 18446744073709551615
uuid 128-bit value as defined by RFC 4122

And then there was SQL

This was sufficient for a few months, until we began to add SQL support. SQL allows one to use the data in many exciting different ways, and also works as a standard and well-known interface to tabled data. We specifically chose the MySQL variant to target first due to its popularity, and with it came a lot more types that we did not support.

SQL Type SQL Aliases Description
BIGINT integers from -9223372036854775808 to 9223372036854775807
BIGINT UNSIGNED integers from 0 to 18446744073709551615
BINARY an array of bytes, up to a maximum of 255
BIT depending on bit depth, holds integers from 0 to 18446744073709551615
BLOB an array of bytes, up to a maximum of 65535
CHAR CHARACTER an array of character, up to a maximum of 255
DATE dates from 1000-01-01 to 9999-12-31
DATETIME dates and times from 1000-01-01 00:00:00 to 9999-12-31 23:59:59
DECIMAL FIXED, DEC, NUMERIC number with a specifiable number of digits, both total and fractional
DOUBLE DOUBLE PRECISION, REAL IEEE-754 double precision number
ENUM definable collection of allowed strings, with a maximum of 65535 entries
FLOAT IEEE-754 single precision number
GEOMETRY a supertype that can hold a point, linestring, or polygon
GEOMETRYCOLLECTION a collection of any geometry-type object
INT INTEGER integers from -2147483648 to 2147483647
INT UNSIGNED INTEGER UNSIGNED integers from 0 to 4294967295
JSON represents the JavaScript Object Notation format for a string
LINESTRING a collection of points forming a line
LONGBLOB an array of bytes, up to a maximum of 4294967295
LONGTEXT an array of characters, up to a maximum of 4294967295
MEDIUMBLOB an array of bytes, up to a maximum of 16777215
MEDIUMINT integers from -8388608 to 8388607
MEDIUMINT UNSIGNED integers from 0 to 16777215
MEDIUMTEXT LONG, LONG VARCHAR an array of characters, up to a maximum of 4194303 (for utf8 encoding)
MULTILINESTRING a collection of linestrings
MULTIPOINT a collection of points
MULTIPOLYGON a collection of polygons
NCHAR NATIONAL CHAR shortcut for appending CHARACTER SET utf8mb3 to CHAR
NVARCHAR NATIONAL CHARACTER VARYING shortcut for appending CHARACTER SET utf8mb3 to VARCHAR
POINT represents a point in space
POLYGON a collection of points forming a shape
SET definable collection of allowed strings, with a max of 64 entries, and allows combinations
SMALLINT integers from -32768 to 32767
SMALLINT UNSIGNED integers from 0 to 65535
TEXT an array of characters, up to a maximum of 16383 (for utf8 encoding)
TIME times from -838:59:59 to 838:59:59
TIMESTAMP dates and times from 1970-01-01 00:00:01 to 2038-01-19 03:14:07
TINYBLOB an array of bytes, up to a maximum of 255
TINYINT BOOLEAN, BOOL integers from -128 to 127
TINYINT UNSIGNED integers from 0 to 255
TINYTEXT an array of characters, up to a maximum of 63 (for utf8 encoding)
VARBINARY an array of bytes, up to a maximum of 65535
VARCHAR CHARACTER VARYING an array of characters, up to a maximum of 16383 (for utf8 encoding)
YEAR integers from 1901 to 2155, along with 0

Our first solution was to map any types that are more restrictive to their less restrictive variants, such as a TINYINT (int8) to a BIGINT (int64), and store them as our closest Noms-equivalent type. Although the user lost a bit of type information, it would still allow them to export their data from MySQL and import it into Dolt. This solution did not work for other types, such as DATETIME, as there was no clear analog in Dolt, and thus we had to add two more types to our Noms-based backend: timestamp and inlineblob.

Type Description
bool represents either true or false
float64 IEEE-754 double precision number
inlineblob an array of bytes, up to a maximum of 65535 bytes
int64 integers from -9223372036854775808 to 9223372036854775807
string an array of characters
timestamp dates and times from -200000000-01-01 00:00:00 UTC to 200000000-12-31 23:59:59 UTC
uint64 integers from 0 to 18446744073709551615
uuid 128-bit value as defined by RFC 4122

This allowed us to map to many of the common types seen in MySQL databases.

SQL Type Noms Type
BIGINT int64
BIGINT UNSIGNED uint64
BINARY
BIT
BLOB
CHAR string
DATE timestamp
DATETIME timestamp
DECIMAL
DOUBLE float64
ENUM
FLOAT float64
GEOMETRY
GEOMETRYCOLLECTION
INT int64
INT UNSIGNED uint64
JSON
LINESTRING
LONGBLOB
LONGTEXT string
MEDIUMBLOB
MEDIUMINT int64
MEDIUMINT UNSIGNED uint64
MEDIUMTEXT string
MULTILINESTRING
MULTIPOINT
MULTIPOLYGON
NCHAR
NVARCHAR
POINT
POLYGON
SET
SMALLINT int64
SMALLINT UNSIGNED int64
TEXT string
TIME
TIMESTAMP timestamp
TINYBLOB
TINYINT int64
TINYINT UNSIGNED uint64
TINYTEXT string
VARBINARY
VARCHAR string
YEAR
bool
inlineblob
uuid

Of note, both bool and uuid do not have an equivalent in SQL, and thus are inaccessible as a backing Noms type when creating a table through SQL, since no types map to them. Although inlineblob would satisfy some of the binary/blob types, it would not work for LONGBLOB, so its integration with SQL types was postponed.

Enter go-mysql-server

As work on the SQL engine continued, we began running into the issue that our Noms types worked a bit differently than MySQL's types under some circumstances, and thus we worked to achieve complete parity with their types. Our goal was to be a fully drop-in replacement for MySQL, and that included all of the idiosyncrasies that may not be immediately apparent. This way, one could round-trip data between MySQL and Dolt without any surprises or loss of data. It would also allow anyone to gain the versioning capabilities of Dolt while keeping the workflow that has already been established with their MySQL integration.

We adopted a SQL engine built on top of our fork of the go-mysql-server project. In order to achieve full type compliance, most of the type code had to be rewritten. We referenced MySQL's documentation, however we found a lot of little interactions that were not documented. For example, assigning the string 1 to an integer column (BIGINT, etc.) would parse the string as though it contained an integer. However, assigning the string 1 to a BIT column would parse the string as though it were an array of utf8-encoded characters, which it would then convert to bytes, giving a value of 49 in this example. Considering how similarly BIT behaves relative to the other integer types, this difference was not assumed to have existed until it was manually found.

We have a test framework for testing correctness, however, the focus of the correctness tests is less about types and more on query execution and features. This meant that a lot of manual testing with a MySQL instance was used to better understand how each type worked, writing tests for the observed behavior to guarantee our implementation's parity.

Putting it all together in Dolt

Once the SQL engine handled types properly, it was on to the Dolt side. In Dolt, a schema’s type was just a reference to the Noms type, and the decision was made to give Dolt its own concept of types, and for those types to handle how SQL and Noms types relate to Dolt. This way, Dolt did not need to map SQL types directly to similar Noms types, and could freely convert between the two representations. This allowed for many SQL types, such as SET, to map to a completely different Noms type, such as uint64, as the Dolt type could handle the uint64 as a bit-array, and map each bit to a SET entry. This also had another benefit: any Dolt types that do not exist in SQL now have an interface through which SQL can interact with them, retaining compatibility with old Dolt databases. For example, uuid does not have a parallel in SQL, and thus appears as a CHAR(36) when accessed through SQL, however still retains all of the type checking necessary for that type. In the future, if we decide to extend our SQL types to support uuids, then it's as simple as changing the displayed SQL type, without changing anything in Noms. For the majority of SQL types, the Dolt type is simply a conversion from the SQL value to the closest Noms value that can uniquely represent that value, and vice-versa.

SQL Type Dolt Type Noms Type
BIGINT number[i64] int64
BIGINT UNSIGNED number[u64] uint64
BINARY inlineblob[bin, length] inlineblob
BIT bit[depth] uint64
BLOB varbinary[65535] blob
CHAR varstring[char] string
DATE datetime[date] timestamp
DATETIME datetime[dt] timestamp
DECIMAL decimal[prec, scale] decimal
DOUBLE number[f64] float64
ENUM enum[entries] int64
FLOAT number[f32] float64
GEOMETRY
GEOMETRYCOLLECTION
INT number[i32] int64
INT UNSIGNED number[u32] uint64
JSON
LINESTRING
LONGBLOB varbinary[4294967295] blob
LONGTEXT varstring[text, 4294967295] string
MEDIUMBLOB varbinary[16777215] blob
MEDIUMINT number[i24] int64
MEDIUMINT UNSIGNED number[u24] uint64
MEDIUMTEXT varstring[text, enc_length] string
MULTILINESTRING
MULTIPOINT
MULTIPOLYGON
NCHAR varstring[char, length] string
NVARCHAR varstring[var, length] string
POINT
POLYGON
SET set[entries] uint64
SMALLINT number[i16] int64
SMALLINT UNSIGNED number[u16] int64
TEXT varstring[text, enc_length] string
TIME time int64
TIMESTAMP datetime[ts] timestamp
TINYBLOB varbinary[255] blob
TINYINT number[i8] int64
TINYINT UNSIGNED number[u8] uint64
TINYTEXT varstring[text, enc_length] string
VARBINARY inlineblob[var, length] inlineblob
VARCHAR varstring[var, length] string
YEAR year int64

For a SMALLINT (int16), Dolt has a number type that converts it to an int64 for Noms to handle, and back to an int16 when retrieving the value, using the SQL engine to verify the value before storage. This way, a Dolt type does not need to worry about how to hash a value or perform any of the other operations, as the underlying Noms value already does that.

What's still missing?

This has allowed us to add the majority of types from MySQL, however there are still a few missing, such as: JSON, binary types, and GEOMETRY (including all subtypes). We also save character sets and collations on all relevant types, however they are currently ignored. We are working hard to bring you these missing types, so check our release notes on GitHub to stay updated! Better yet, if you need a new type, implement it and send us a pull request.

Conclusion

Types are an integral part of a database meant for sharing data. They communicate intention, while also enforcing homogeneity to the data. It is imperative that we get types right, and our journey through building Dolt has landed us with a type implementation that we feel is stable, extensible, and ready for the challenges that our users will demand.

If you enjoyed reading this blog, why not checkout Dolt, or browse our repositories here on DoltHub?

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.