Using Dolt with Deepnote
Dolt is Git for data, a SQL database with Git-style versioning. DoltHub is a place on the internet to store and share Dolt databases. Python is the language of data science. As such, we created Doltpy, a Python interface to Dolt. We continue to iterate and improve Doltpy based on user feedback.
Recently, the folks at Deepnote reached out to us and encouraged us to try their notebook platform. Deepnote is Jupyter notebook compatible and offers hosted, real-time collaboration. The easiest way to describe it is the "Google Sheets" or "Google Docs" of Jupyter notebooks.
Given Doltpy, Dolt/Deepnote integration was fast and easy. Moreover, since Deepnote is a web hosted platform it makes for an ideal demo environment. If you like what you see, you can duplicate this notebook and use it as a template to start your very own Dolt data analysis project.
Setup
To use Dolt with Deepnote requires Dolt and Doltpy be installed. To get Dolt installed, you click the environment button on the left in Deepnote, click the Dockerfile link, and add the following text to the editable Dockerfile that appears.
RUN sudo curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash
Click the Build button in the top right corner. After it completes, click "Restart Machine" and Dolt will be installed on the Deepnote image you are using.
To install Doltpy, Deepnote recommends you add a requirements.txt to your project root. Ours contains a single doltpy
line, but you can add other Python packages if you need them.
That's it, you're now ready to access Dolt via you Python notebook.
Reading data from Dolt
You now have access to the full Dolt command line and Doltpy. This should be enough to do all manner of data analysis on your own datasets or public datasets on DoltHub.
Cloning
Usually the first step to use Dolt is to clone a copy of your data locally. This makes queries faster and it allows you to make edits without worrying about messing up other people's copies. These edits can be merged back into the master
branch later, just like in Git.
With Dolt and Doltpy installed, you have two options, you can clone a database using the Dolt command line or do the cloning from within Doltpy. For this example, we clone using the command line and then read a table into a Pandas Dataframe using Doltpy.
SQL
You have a full SQl database at your fingertips. So, if you don't want to do data manipulation in Python, you can do it in SQL. This example starts a Dolt MySQL compatible server and reads the results of a query into a Pandas Dataframe.
API
If you don't want to clone the data locally, DoltHub has a versioned SQL to JSON API for every database. The API supports branches and releases so you can pin your notebook to a specific data version. Never have data change out from under you again. This example reads from the tip of master but appending a /<hbranch>
to the end of the API reads from that branch. Cut a data release branch just like you would in code.
Writing data to Dolt
There's lots of data sources that you can read from but Dolt has the unique capability of providing safe and distributed writes. Modify the schema and the data to make your analysis easier. In this example, we add a column to a table and populate it for a couple rows. We even show off Dolt's diff functionality. The code and output is a little long for a screenshot so head to Deepnote and see for yourself.
Conclusion
As you can see, Deepnote, Dolt, and Doltpy work well together. We particularly like Deepnote for showing of Dolt's capabilities because it's a hosted service. Anyone can see and edit a notebook just like we see with products like Google Sheets. Give both a try today.