Doltpy 1.0.0
Background
Dolt is a SQL database that stores data in a commit graph, and offers a Git-like interface for management. It offers a command-line-interface (CLI) that provides managing database level considerations such as how and where to start a server instance, and SQL as a query interface. In general features that do not pertain to processes and the filesystem are surfaced in SQL. Through these two interfaces users can clone, push, pull, branch, merge, and query Dolt databases.
In addition to the CLI and SQL, we wanted to provide an API in a widely adopted scripting language. Python is the most widely adopted language in technical communities centered on data. For example Python is heavily used in data science and machine learning. We decided to expose Dolt in Python via Doltpy, at first simply to test it out ourselves on production workflows. After using it for some time, we decided to take it out of the experimental phase and architect it as a third "complete interface" for interacting with Dolt, to compliment the command line interface (CLI) and SQL.
The product of the decision to offer a Python API was Doltpy.
Doltpy
Our initial implementation of Doltpy arose from trying to get some results that a team member had assembled in a Pandas DataFrame into Dolt. It was experimental and we more or less implemented functions wrapping the Dolt CLI as we needed them. This meant that we had spotty coverage of our CLI functionality, and names of the functions had "drifted" from the CLI meaning we had a "similar but different" API. We were surfacing much of the same functionality, but with a new, slightly different, set of names. For example dolt add
was surfaced as add_table_to_next_commit
, and dolt log
was get_commits
. The names made sense as "standalone" functions, but when bought together as an API they were confusing.
Since we were implementing these functions as we went, this lack of naming discipline wasn't really an issue. Normally the author and the user were the same person. However, to someone who might have interacted with Dolt via the command line using the CLI and decided to script some of their workflows, it necessitated digging through the code to discern both the function to use, and whether it behaved as expected. This is the user we have in mind for Doltpy, thus our existing ad-hoc "similar but different" implementation really did not meet the needs of the user we were targeting.
We decided to address this problem head-on by bringing Doltpy into "compliance" with the command line, and separating out utility functions that are not part of the CLI into their own packages grouped by concern. We hoped to achieve the best of both worlds: provide total familiarity, down to function and argument names, to the CLI user, while at the same time providing some higher level tooling for common operations such as importing data into Dolt.
doltpy.core
Most of the changes center around the core
package. In the past we covered a subset of the API with functions that were written primarily to solve a problem we faced while using Dolt. For example, getting a list of commits or the status of a repo:
def get_commits(self) -> Mapping[str, DoltCommitSummary]:
...
def get_dirty_tables(self) -> Tuple[Mapping[str, bool], Mapping[str, bool]]:
...
Making the transition from CLI to Doltpy as intuitive and seamless as possible for the user motivated us to reimplement such functions as follows:
def status(self) -> DoltStatus:
...
def log(self, number: int = None, commit: str = None) -> OrderedDict
...
Our design goal of the Dolt CLI was to make it familiar to anyone with experience using Git. Our design goal for Doltpy was to make it familiar to anyone with experience using Dolt. We want to make this interface accessible using user's existing knowledge of the tools.
doltpy.core.write
Prior to this release Doltpy's Dolt
object had two main write paths:
def import_df(self,
table_name: str,
data: pd.DataFrame,
primary_keys: List[str],
import_mode: str = None):
...
def bulk_import(self,
table_name: str,
data: io.StringIO,
primary_keys: List[str],
import_mode: str = None):
...
We wrote these functions because there were concrete use-cases where we used Dolt. They were useful primarily in experimental scenarios, but totally broke down in production due to well documented issues with Pandas. The specific issue for our use case is that Pandas has some peculiar quirks to its type system that can make inferring the schema of a table based on a Pandas DataFrame a non-starter, especially if the table has NULL values.
Since Dolt can run a MySQL Server, we realized that we could take advantage of the investment in MySQL connectors by building out interfaces that allowed users to write Python data structures directly into Dolt. So, as part of breaking write functionality into its own package, we added new write paths for lists and dictionaries. The list import expects a list of dictionaries, each with the same set of keys. The dictionary interface treats the keys as column names, and requires each key have a list value of uniform length. Here is what the function signatures look like:
def import_list(repo: Dolt,
table_name: str,
data: List[Mapping[str, Any]],
primary_keys: List[str] = None,
import_mode: str = None):
...
def import_dict(repo: Dolt,
table_name: str,
data: Mapping[str, List[Any]],
primary_keys: List[str] = None,
import_mode: str = None):
...
Let's look at how we might use these in practice:
from doltpy.core.write import import_dict, import_list
from doltpy.core.read import read_table
import pandas as pd
DICT_OF_LISTS = {
'name': ['Anna', 'Vronksy', 'Oblonksy'],
'adjective': ['tragic', 'honorable', 'buffoon'],
'id': [1, 2, 3]
}
repo = Dolt('path/to/repo')
import_dict(repo, 'characters', DICT_OF_LISTS, ['id'], 'create')
LIST_OF_DICTS = [
{'name': 'Anna', 'adjective': 'tragic', 'id': 1},
{'name': 'Vronksy', 'adjective': 'honorable', 'id': 2},
{'name': 'Oblonksy', 'adjective': 'buffoon', 'id': 3},
]
repo = Dolt('path/to/repo')
import_list(repo, 'characters', LIST_OF_DICTS, ['id'], 'create')
Note that we passed 'create'
as the import_mode
parameter. This means that our Python code will examine the data structures, and assuming they are shaped correctly, infer a table schema, create that table, and import the data. For now "shaped correctly" means that dictionaries with list values have list values of identical length, and lists of dictionaries representing rows each have the same keys. Also, a column that has all null values will break the schema inference, since we cannot infer a type. The type inference is currently limited, and supports strings, floats, integers, and datetimes. We will add more robust type inference in future releases. In the near future we will support options to not require every row to have the same set of keys in a list of dictionaries, and improve our type inference.
To update existing tables using these tools, simply pass update
to import_mode
.
dolt.core.read
We offered read_table
and pandas_read_sql
as ways to pull data out of Dolt repository via the Dolt
class. As part of our effort to make Dolt
essentially a Python representation of the CLI, we have broken those read paths out into a utility package, though they function virtually identically:
def read_table(repo: Dolt, table_name: str, delimiter: str = ',') -> pd.DataFrame:
...
def pandas_read_sql(repo: Dolt, query: str, connection: connector.connection) -> pd.DataFrame:
...
In a future release we will introduce some new interfaces, including ones that use Python data structures, rather than Pandas.
Conclusion
In designing Doltpy we set out to offer all of Dolt, both CLI and SQL, plus useful write utilities, in a popular higher level scripting language. We think the Doltpy we have set forth in 1.0.0 makes Dolt an attractive tool for the technical communities that have already adopted Python, and are looking for a robust solution to data versioning, management, and distribution.