How Dolt Types Work

April 15, 2020

9 min read

UPDATED FEBRUARY 10, 2021: Updated the final table with the types that have been added to Dolt since the article was first written.

When we started on Dolt, our goal was to apply Git's idea of versioning to data. Whereas Git versions files, Dolt versions tables. As the creator of a database, you have the choice of how to represent any value. We chose to adopt the idea of having a schema for each of our tables, meaning that any one column has a definition of the type of data that it represents, and all cells under that column adhere to that type.

As an example, let's take a look at the value 1, and let's store it in a column we'll refer to as val.

val
1

If someone were to retrieve the value, what would it represent? Would it represent a count? Would it instead represent a true/false value? Perhaps it represents a character encoded in the ASCII format to denote the start of a message header? It is ambiguous. This could be fixed by using a more descriptive column name.

Number Of Books
1
IV
1101

This now highlights another problem, in that we can add any values that we want, so we inserted the values IV (which is 4 in Roman numerals) and 1101 (which is 13 in binary). Each of these are considered valid values in the column by its name alone. By having a type, the database can verify that the value given conforms to the type, converting if need be, or rejecting if the value is not understood to map to the given type.

Number Of Books `int32`
1
4
13

This type information is even more handy in a database built for sharing like Dolt. Have you ever tried importing a CSV with N/A in a numerical column to represent NULL? We have. It's more work than it should be. Moreover, the type information of a column communicates the schema creator’s intent. It's like documentation that is built-in.

First there was Noms

We initially built Dolt on top of Noms. Although we have made significant changes along the way, their implementation of types was our starting point. In Noms, each value is stored on disk as an array of bytes, and those bytes are prefixed with an integer.

Kind                    Kind
  ↓  Data→                ↓  Data→
-------------------------------------------------
| 1 |   |   |   |   |   | 3 |   |   |   |   |   |
-------------------------------------------------

This integer represents the kind, which is Noms' way of stating the type that those bytes represent, and also how to interpret those bytes. Noms has a general-purpose type system, but for our purpose of collaborating on tabled data, we started using the three scalar types:

Type	Description
bool	represents either true or false
number	IEEE-754 double precision number
string	an array of characters

Each type in Noms has a few properties, such as defining equality, sortability, how to encode and decode from a byte array, and also how to hash the value. When we began working on Dolt, we added our own types into the mix, while also renaming number to float64.

Type	Description
bool	represents either true or false
float64	IEEE-754 double precision number
int64	integers from -9223372036854775808 to 9223372036854775807
string	an array of characters
uint64	integers from 0 to 18446744073709551615
uuid	128-bit value as defined by RFC 4122

And then there was SQL

This was sufficient for a few months, until we began to add SQL support. SQL allows one to use the data in many exciting different ways, and also works as a standard and well-known interface to tabled data. We specifically chose the MySQL variant to target first due to its popularity, and with it came a lot more types that we did not support.

SQL Type	SQL Aliases	Description
BIGINT		integers from -9223372036854775808 to 9223372036854775807
BIGINT UNSIGNED		integers from 0 to 18446744073709551615
BINARY		an array of bytes, up to a maximum of 255
BIT		depending on bit depth, holds integers from 0 to 18446744073709551615
BLOB		an array of bytes, up to a maximum of 65535
CHAR	CHARACTER	an array of character, up to a maximum of 255
DATE		dates from 1000-01-01 to 9999-12-31
DATETIME		dates and times from 1000-01-01 00:00:00 to 9999-12-31 23:59:59
DECIMAL	FIXED, DEC, NUMERIC	number with a specifiable number of digits, both total and fractional
DOUBLE	DOUBLE PRECISION, REAL	IEEE-754 double precision number
ENUM		definable collection of allowed strings, with a maximum of 65535 entries
FLOAT		IEEE-754 single precision number
GEOMETRY		a supertype that can hold a point, linestring, or polygon
GEOMETRYCOLLECTION		a collection of any geometry-type object
INT	INTEGER	integers from -2147483648 to 2147483647
INT UNSIGNED	INTEGER UNSIGNED	integers from 0 to 4294967295
JSON		represents the JavaScript Object Notation format for a string
LINESTRING		a collection of points forming a line
LONGBLOB		an array of bytes, up to a maximum of 4294967295
LONGTEXT		an array of characters, up to a maximum of 4294967295
MEDIUMBLOB		an array of bytes, up to a maximum of 16777215
MEDIUMINT		integers from -8388608 to 8388607
MEDIUMINT UNSIGNED		integers from 0 to 16777215
MEDIUMTEXT	LONG, LONG VARCHAR	an array of characters, up to a maximum of 4194303 (for utf8 encoding)
MULTILINESTRING		a collection of linestrings
MULTIPOINT		a collection of points
MULTIPOLYGON		a collection of polygons
NCHAR	NATIONAL CHAR	shortcut for appending CHARACTER SET utf8mb3 to CHAR
NVARCHAR	NATIONAL CHARACTER VARYING	shortcut for appending CHARACTER SET utf8mb3 to VARCHAR
POINT		represents a point in space
POLYGON		a collection of points forming a shape
SET		definable collection of allowed strings, with a max of 64 entries, and allows combinations
SMALLINT		integers from -32768 to 32767
SMALLINT UNSIGNED		integers from 0 to 65535
TEXT		an array of characters, up to a maximum of 16383 (for utf8 encoding)
TIME		times from -838:59:59 to 838:59:59
TIMESTAMP		dates and times from 1970-01-01 00:00:01 to 2038-01-19 03:14:07
TINYBLOB		an array of bytes, up to a maximum of 255
TINYINT	BOOLEAN, BOOL	integers from -128 to 127
TINYINT UNSIGNED		integers from 0 to 255
TINYTEXT		an array of characters, up to a maximum of 63 (for utf8 encoding)
VARBINARY		an array of bytes, up to a maximum of 65535
VARCHAR	CHARACTER VARYING	an array of characters, up to a maximum of 16383 (for utf8 encoding)
YEAR		integers from 1901 to 2155, along with 0

Our first solution was to map any types that are more restrictive to their less restrictive variants, such as a TINYINT (int8) to a BIGINT (int64), and store them as our closest Noms-equivalent type. Although the user lost a bit of type information, it would still allow them to export their data from MySQL and import it into Dolt. This solution did not work for other types, such as DATETIME, as there was no clear analog in Dolt, and thus we had to add two more types to our Noms-based backend: timestamp and inlineblob.

Type	Description
bool	represents either true or false
float64	IEEE-754 double precision number
inlineblob	an array of bytes, up to a maximum of 65535 bytes
int64	integers from -9223372036854775808 to 9223372036854775807
string	an array of characters
timestamp	dates and times from -200000000-01-01 00:00:00 UTC to 200000000-12-31 23:59:59 UTC
uint64	integers from 0 to 18446744073709551615
uuid	128-bit value as defined by RFC 4122

This allowed us to map to many of the common types seen in MySQL databases.

SQL Type	Noms Type
BIGINT	int64
BIGINT UNSIGNED	uint64
BINARY
BIT
BLOB
CHAR	string
DATE	timestamp
DATETIME	timestamp
DECIMAL
DOUBLE	float64
ENUM
FLOAT	float64
GEOMETRY
GEOMETRYCOLLECTION
INT	int64
INT UNSIGNED	uint64
JSON
LINESTRING
LONGBLOB
LONGTEXT	string
MEDIUMBLOB
MEDIUMINT	int64
MEDIUMINT UNSIGNED	uint64
MEDIUMTEXT	string
MULTILINESTRING
MULTIPOINT
MULTIPOLYGON
NCHAR
NVARCHAR
POINT
POLYGON
SET
SMALLINT	int64
SMALLINT UNSIGNED	int64
TEXT	string
TIME
TIMESTAMP	timestamp
TINYBLOB
TINYINT	int64
TINYINT UNSIGNED	uint64
TINYTEXT	string
VARBINARY
VARCHAR	string
YEAR
	bool
	inlineblob
	uuid

Of note, both bool and uuid do not have an equivalent in SQL, and thus are inaccessible as a backing Noms type when creating a table through SQL, since no types map to them. Although inlineblob would satisfy some of the binary/blob types, it would not work for LONGBLOB, so its integration with SQL types was postponed.

Enter go-mysql-server

As work on the SQL engine continued, we began running into the issue that our Noms types worked a bit differently than MySQL's types under some circumstances, and thus we worked to achieve complete parity with their types. Our goal was to be a fully drop-in replacement for MySQL, and that included all of the idiosyncrasies that may not be immediately apparent. This way, one could round-trip data between MySQL and Dolt without any surprises or loss of data. It would also allow anyone to gain the versioning capabilities of Dolt while keeping the workflow that has already been established with their MySQL integration.

We adopted a SQL engine built on top of our fork of the go-mysql-server project. In order to achieve full type compliance, most of the type code had to be rewritten. We referenced MySQL's documentation, however we found a lot of little interactions that were not documented. For example, assigning the string 1 to an integer column (BIGINT, etc.) would parse the string as though it contained an integer. However, assigning the string 1 to a BIT column would parse the string as though it were an array of utf8-encoded characters, which it would then convert to bytes, giving a value of 49 in this example. Considering how similarly BIT behaves relative to the other integer types, this difference was not assumed to have existed until it was manually found.

We have a test framework for testing correctness, however, the focus of the correctness tests is less about types and more on query execution and features. This meant that a lot of manual testing with a MySQL instance was used to better understand how each type worked, writing tests for the observed behavior to guarantee our implementation's parity.

Putting it all together in Dolt

Once the SQL engine handled types properly, it was on to the Dolt side. In Dolt, a schema’s type was just a reference to the Noms type, and the decision was made to give Dolt its own concept of types, and for those types to handle how SQL and Noms types relate to Dolt. This way, Dolt did not need to map SQL types directly to similar Noms types, and could freely convert between the two representations. This allowed for many SQL types, such as SET, to map to a completely different Noms type, such as uint64, as the Dolt type could handle the uint64 as a bit-array, and map each bit to a SET entry. This also had another benefit: any Dolt types that do not exist in SQL now have an interface through which SQL can interact with them, retaining compatibility with old Dolt databases. For example, uuid does not have a parallel in SQL, and thus appears as a CHAR(36) when accessed through SQL, however still retains all of the type checking necessary for that type. In the future, if we decide to extend our SQL types to support uuids, then it's as simple as changing the displayed SQL type, without changing anything in Noms. For the majority of SQL types, the Dolt type is simply a conversion from the SQL value to the closest Noms value that can uniquely represent that value, and vice-versa.

SQL Type	Dolt Type	Noms Type
BIGINT	number[i64]	int64
BIGINT UNSIGNED	number[u64]	uint64
BINARY	inlineblob[bin, length]	inlineblob
BIT	bit[depth]	uint64
BLOB	varbinary[65535]	blob
CHAR	varstring[char]	string
DATE	datetime[date]	timestamp
DATETIME	datetime[dt]	timestamp
DECIMAL	decimal[prec, scale]	decimal
DOUBLE	number[f64]	float64
ENUM	enum[entries]	int64
FLOAT	number[f32]	float64
GEOMETRY
GEOMETRYCOLLECTION
INT	number[i32]	int64
INT UNSIGNED	number[u32]	uint64
JSON
LINESTRING
LONGBLOB	varbinary[4294967295]	blob
LONGTEXT	varstring[text, 4294967295]	string
MEDIUMBLOB	varbinary[16777215]	blob
MEDIUMINT	number[i24]	int64
MEDIUMINT UNSIGNED	number[u24]	uint64
MEDIUMTEXT	varstring[text, enc_length]	string
MULTILINESTRING
MULTIPOINT
MULTIPOLYGON
NCHAR	varstring[char, length]	string
NVARCHAR	varstring[var, length]	string
POINT
POLYGON
SET	set[entries]	uint64
SMALLINT	number[i16]	int64
SMALLINT UNSIGNED	number[u16]	int64
TEXT	varstring[text, enc_length]	string
TIME	time	int64
TIMESTAMP	datetime[ts]	timestamp
TINYBLOB	varbinary[255]	blob
TINYINT	number[i8]	int64
TINYINT UNSIGNED	number[u8]	uint64
TINYTEXT	varstring[text, enc_length]	string
VARBINARY	inlineblob[var, length]	inlineblob
VARCHAR	varstring[var, length]	string
YEAR	year	int64

For a SMALLINT (int16), Dolt has a number type that converts it to an int64 for Noms to handle, and back to an int16 when retrieving the value, using the SQL engine to verify the value before storage. This way, a Dolt type does not need to worry about how to hash a value or perform any of the other operations, as the underlying Noms value already does that.

What's still missing?

This has allowed us to add the majority of types from MySQL, however there are still a few missing, such as: JSON, binary types, and GEOMETRY (including all subtypes). We also save character sets and collations on all relevant types, however they are currently ignored. We are working hard to bring you these missing types, so check our release notes on GitHub to stay updated! Better yet, if you need a new type, implement it and send us a pull request.

Conclusion

Types are an integral part of a database meant for sharing data. They communicate intention, while also enforcing homogeneity to the data. It is imperative that we get types right, and our journey through building Dolt has landed us with a type implementation that we feel is stable, extensible, and ready for the challenges that our users will demand.

If you enjoyed reading this blog, why not checkout Dolt, or browse our repositories here on DoltHub?

Blog