How Dolt Types Work
UPDATED FEBRUARY 10, 2021: Updated the final table with the types that have been added to Dolt since the article was first written.
When we started on Dolt, our goal was to apply Git's idea of versioning to data. Whereas Git versions files, Dolt versions tables. As the creator of a database, you have the choice of how to represent any value. We chose to adopt the idea of having a schema for each of our tables, meaning that any one column has a definition of the type of data that it represents, and all cells under that column adhere to that type.
As an example, let's take a look at the value 1
, and let's store it in a column we'll refer to as val
.
val |
---|
1 |
If someone were to retrieve the value, what would it represent? Would it represent a count? Would it instead represent a true/false value? Perhaps it represents a character encoded in the ASCII format to denote the start of a message header? It is ambiguous. This could be fixed by using a more descriptive column name.
Number Of Books |
---|
1 |
IV |
1101 |
This now highlights another problem, in that we can add any values that we want, so we inserted the values IV
(which is 4
in Roman numerals) and 1101
(which is 13
in binary). Each of these are considered valid values in the column by its name alone. By having a type, the database can verify that the value given conforms to the type, converting if need be, or rejecting if the value is not understood to map to the given type.
Number Of Books int32 |
---|
1 |
4 |
13 |
This type information is even more handy in a database built for sharing like Dolt. Have you ever tried importing a CSV with N/A in a numerical column to represent NULL? We have. It's more work than it should be. Moreover, the type information of a column communicates the schema creator’s intent. It's like documentation that is built-in.
First there was Noms
We initially built Dolt on top of Noms. Although we have made significant changes along the way, their implementation of types was our starting point. In Noms, each value is stored on disk as an array of bytes, and those bytes are prefixed with an integer.
Kind Kind
↓ Data→ ↓ Data→
-------------------------------------------------
| 1 | | | | | | 3 | | | | | |
-------------------------------------------------
This integer represents the kind
, which is Noms' way of stating the type that those bytes represent, and also how to interpret those bytes. Noms has a general-purpose type system, but for our purpose of collaborating on tabled data, we started using the three scalar types:
Type | Description |
---|---|
bool | represents either true or false |
number | IEEE-754 double precision number |
string | an array of characters |
Each type in Noms has a few properties, such as defining equality, sortability, how to encode and decode from a byte array, and also how to hash the value. When we began working on Dolt, we added our own types into the mix, while also renaming number
to float64
.
Type | Description |
---|---|
bool | represents either true or false |
float64 | IEEE-754 double precision number |
int64 | integers from -9223372036854775808 to 9223372036854775807 |
string | an array of characters |
uint64 | integers from 0 to 18446744073709551615 |
uuid | 128-bit value as defined by RFC 4122 |
And then there was SQL
This was sufficient for a few months, until we began to add SQL support. SQL allows one to use the data in many exciting different ways, and also works as a standard and well-known interface to tabled data. We specifically chose the MySQL variant to target first due to its popularity, and with it came a lot more types that we did not support.
SQL Type | SQL Aliases | Description |
---|---|---|
BIGINT | integers from -9223372036854775808 to 9223372036854775807 | |
BIGINT UNSIGNED | integers from 0 to 18446744073709551615 | |
BINARY | an array of bytes, up to a maximum of 255 | |
BIT | depending on bit depth, holds integers from 0 to 18446744073709551615 | |
BLOB | an array of bytes, up to a maximum of 65535 | |
CHAR | CHARACTER | an array of character, up to a maximum of 255 |
DATE | dates from 1000-01-01 to 9999-12-31 | |
DATETIME | dates and times from 1000-01-01 00:00:00 to 9999-12-31 23:59:59 | |
DECIMAL | FIXED, DEC, NUMERIC | number with a specifiable number of digits, both total and fractional |
DOUBLE | DOUBLE PRECISION, REAL | IEEE-754 double precision number |
ENUM | definable collection of allowed strings, with a maximum of 65535 entries | |
FLOAT | IEEE-754 single precision number | |
GEOMETRY | a supertype that can hold a point, linestring, or polygon | |
GEOMETRYCOLLECTION | a collection of any geometry-type object | |
INT | INTEGER | integers from -2147483648 to 2147483647 |
INT UNSIGNED | INTEGER UNSIGNED | integers from 0 to 4294967295 |
JSON | represents the JavaScript Object Notation format for a string | |
LINESTRING | a collection of points forming a line | |
LONGBLOB | an array of bytes, up to a maximum of 4294967295 | |
LONGTEXT | an array of characters, up to a maximum of 4294967295 | |
MEDIUMBLOB | an array of bytes, up to a maximum of 16777215 | |
MEDIUMINT | integers from -8388608 to 8388607 | |
MEDIUMINT UNSIGNED | integers from 0 to 16777215 | |
MEDIUMTEXT | LONG, LONG VARCHAR | an array of characters, up to a maximum of 4194303 (for utf8 encoding) |
MULTILINESTRING | a collection of linestrings | |
MULTIPOINT | a collection of points | |
MULTIPOLYGON | a collection of polygons | |
NCHAR | NATIONAL CHAR | shortcut for appending CHARACTER SET utf8mb3 to CHAR |
NVARCHAR | NATIONAL CHARACTER VARYING | shortcut for appending CHARACTER SET utf8mb3 to VARCHAR |
POINT | represents a point in space | |
POLYGON | a collection of points forming a shape | |
SET | definable collection of allowed strings, with a max of 64 entries, and allows combinations | |
SMALLINT | integers from -32768 to 32767 | |
SMALLINT UNSIGNED | integers from 0 to 65535 | |
TEXT | an array of characters, up to a maximum of 16383 (for utf8 encoding) | |
TIME | times from -838:59:59 to 838:59:59 | |
TIMESTAMP | dates and times from 1970-01-01 00:00:01 to 2038-01-19 03:14:07 | |
TINYBLOB | an array of bytes, up to a maximum of 255 | |
TINYINT | BOOLEAN, BOOL | integers from -128 to 127 |
TINYINT UNSIGNED | integers from 0 to 255 | |
TINYTEXT | an array of characters, up to a maximum of 63 (for utf8 encoding) | |
VARBINARY | an array of bytes, up to a maximum of 65535 | |
VARCHAR | CHARACTER VARYING | an array of characters, up to a maximum of 16383 (for utf8 encoding) |
YEAR | integers from 1901 to 2155, along with 0 |
Our first solution was to map any types that are more restrictive to their less restrictive variants, such as a TINYINT
(int8
) to a BIGINT
(int64
), and store them as our closest Noms-equivalent type. Although the user lost a bit of type information, it would still allow them to export their data from MySQL and import it into Dolt. This solution did not work for other types, such as DATETIME
, as there was no clear analog in Dolt, and thus we had to add two more types to our Noms-based backend: timestamp
and inlineblob
.
Type | Description |
---|---|
bool | represents either true or false |
float64 | IEEE-754 double precision number |
inlineblob | an array of bytes, up to a maximum of 65535 bytes |
int64 | integers from -9223372036854775808 to 9223372036854775807 |
string | an array of characters |
timestamp | dates and times from -200000000-01-01 00:00:00 UTC to 200000000-12-31 23:59:59 UTC |
uint64 | integers from 0 to 18446744073709551615 |
uuid | 128-bit value as defined by RFC 4122 |
This allowed us to map to many of the common types seen in MySQL databases.
SQL Type | Noms Type |
---|---|
BIGINT | int64 |
BIGINT UNSIGNED | uint64 |
BINARY | |
BIT | |
BLOB | |
CHAR | string |
DATE | timestamp |
DATETIME | timestamp |
DECIMAL | |
DOUBLE | float64 |
ENUM | |
FLOAT | float64 |
GEOMETRY | |
GEOMETRYCOLLECTION | |
INT | int64 |
INT UNSIGNED | uint64 |
JSON | |
LINESTRING | |
LONGBLOB | |
LONGTEXT | string |
MEDIUMBLOB | |
MEDIUMINT | int64 |
MEDIUMINT UNSIGNED | uint64 |
MEDIUMTEXT | string |
MULTILINESTRING | |
MULTIPOINT | |
MULTIPOLYGON | |
NCHAR | |
NVARCHAR | |
POINT | |
POLYGON | |
SET | |
SMALLINT | int64 |
SMALLINT UNSIGNED | int64 |
TEXT | string |
TIME | |
TIMESTAMP | timestamp |
TINYBLOB | |
TINYINT | int64 |
TINYINT UNSIGNED | uint64 |
TINYTEXT | string |
VARBINARY | |
VARCHAR | string |
YEAR | |
bool | |
inlineblob | |
uuid |
Of note, both bool
and uuid
do not have an equivalent in SQL, and thus are inaccessible as a backing Noms type when creating a table through SQL, since no types map to them. Although inlineblob
would satisfy some of the binary
/blob
types, it would not work for LONGBLOB
, so its integration with SQL types was postponed.
Enter go-mysql-server
As work on the SQL engine continued, we began running into the issue that our Noms types worked a bit differently than MySQL's types under some circumstances, and thus we worked to achieve complete parity with their types. Our goal was to be a fully drop-in replacement for MySQL, and that included all of the idiosyncrasies that may not be immediately apparent. This way, one could round-trip data between MySQL and Dolt without any surprises or loss of data. It would also allow anyone to gain the versioning capabilities of Dolt while keeping the workflow that has already been established with their MySQL integration.
We adopted a SQL engine built on top of our fork of the go-mysql-server project. In order to achieve full type compliance, most of the type code had to be rewritten. We referenced MySQL's documentation, however we found a lot of little interactions that were not documented. For example, assigning the string 1
to an integer column (BIGINT
, etc.) would parse the string as though it contained an integer. However, assigning the string 1
to a BIT
column would parse the string as though it were an array of utf8-encoded characters, which it would then convert to bytes, giving a value of 49
in this example. Considering how similarly BIT
behaves relative to the other integer types, this difference was not assumed to have existed until it was manually found.
We have a test framework for testing correctness, however, the focus of the correctness tests is less about types and more on query execution and features. This meant that a lot of manual testing with a MySQL instance was used to better understand how each type worked, writing tests for the observed behavior to guarantee our implementation's parity.
Putting it all together in Dolt
Once the SQL engine handled types properly, it was on to the Dolt side. In Dolt, a schema’s type was just a reference to the Noms type, and the decision was made to give Dolt its own concept of types, and for those types to handle how SQL and Noms types relate to Dolt. This way, Dolt did not need to map SQL types directly to similar Noms types, and could freely convert between the two representations. This allowed for many SQL types, such as SET
, to map to a completely different Noms type, such as uint64
, as the Dolt type could handle the uint64
as a bit-array, and map each bit to a SET
entry. This also had another benefit: any Dolt types that do not exist in SQL now have an interface through which SQL can interact with them, retaining compatibility with old Dolt databases. For example, uuid
does not have a parallel in SQL, and thus appears as a CHAR(36)
when accessed through SQL, however still retains all of the type checking necessary for that type. In the future, if we decide to extend our SQL types to support uuids
, then it's as simple as changing the displayed SQL type, without changing anything in Noms. For the majority of SQL types, the Dolt type is simply a conversion from the SQL value to the closest Noms value that can uniquely represent that value, and vice-versa.
SQL Type | Dolt Type | Noms Type |
---|---|---|
BIGINT | number[i64] | int64 |
BIGINT UNSIGNED | number[u64] | uint64 |
BINARY | inlineblob[bin, length] | inlineblob |
BIT | bit[depth] | uint64 |
BLOB | varbinary[65535] | blob |
CHAR | varstring[char] | string |
DATE | datetime[date] | timestamp |
DATETIME | datetime[dt] | timestamp |
DECIMAL | decimal[prec, scale] | decimal |
DOUBLE | number[f64] | float64 |
ENUM | enum[entries] | int64 |
FLOAT | number[f32] | float64 |
GEOMETRY | ||
GEOMETRYCOLLECTION | ||
INT | number[i32] | int64 |
INT UNSIGNED | number[u32] | uint64 |
JSON | ||
LINESTRING | ||
LONGBLOB | varbinary[4294967295] | blob |
LONGTEXT | varstring[text, 4294967295] | string |
MEDIUMBLOB | varbinary[16777215] | blob |
MEDIUMINT | number[i24] | int64 |
MEDIUMINT UNSIGNED | number[u24] | uint64 |
MEDIUMTEXT | varstring[text, enc_length] | string |
MULTILINESTRING | ||
MULTIPOINT | ||
MULTIPOLYGON | ||
NCHAR | varstring[char, length] | string |
NVARCHAR | varstring[var, length] | string |
POINT | ||
POLYGON | ||
SET | set[entries] | uint64 |
SMALLINT | number[i16] | int64 |
SMALLINT UNSIGNED | number[u16] | int64 |
TEXT | varstring[text, enc_length] | string |
TIME | time | int64 |
TIMESTAMP | datetime[ts] | timestamp |
TINYBLOB | varbinary[255] | blob |
TINYINT | number[i8] | int64 |
TINYINT UNSIGNED | number[u8] | uint64 |
TINYTEXT | varstring[text, enc_length] | string |
VARBINARY | inlineblob[var, length] | inlineblob |
VARCHAR | varstring[var, length] | string |
YEAR | year | int64 |
For a SMALLINT
(int16
), Dolt has a number
type that converts it to an int64
for Noms to handle, and back to an int16
when retrieving the value, using the SQL engine to verify the value before storage. This way, a Dolt type does not need to worry about how to hash a value or perform any of the other operations, as the underlying Noms value already does that.
What's still missing?
This has allowed us to add the majority of types from MySQL, however there are still a few missing, such as: JSON
, binary types, and GEOMETRY
(including all subtypes). We also save character sets and collations on all relevant types, however they are currently ignored. We are working hard to bring you these missing types, so check our release notes on GitHub to stay updated! Better yet, if you need a new type, implement it and send us a pull request.
Conclusion
Types are an integral part of a database meant for sharing data. They communicate intention, while also enforcing homogeneity to the data. It is imperative that we get types right, and our journey through building Dolt has landed us with a type implementation that we feel is stable, extensible, and ready for the challenges that our users will demand.
If you enjoyed reading this blog, why not checkout Dolt, or browse our repositories here on DoltHub?