Open Crop and Soil Database

DATASET
3 min read

We originally built Dolt and DoltHub for data sharing. We still love that use case. Our most popular open databases are US stock market databases published by post-no-preference. Recently, a user published a new database and was wondering if we could feature it to our community. Obviously, we said yes.

OurSci just published a cool open database of crop and soil samples on DoltHub. OurSci partnered with the Bionutrient Institute to collect the data between 2020 and 2023. The database is published free and open on DoltHub. I interviewed the contributor, Octavio Duarte, to get more information about the database. This article provides more information about the database in hopes of garnering some interest in the data from the DoltHub community.

What is this database?

The database contains scientifically useful information about 3136 crop and soil samples. The README on DoltHub has a pretty good explanation of what information each sample contains.

Each entry contains information about a crop sample, the soil it grew over and the farm practices that it's source farm employed. Information in each row is typically associated to a crop sample, two soil samples (0-10cm and 10-20cm depth) and metadata provided by farmers or data collectors.

The database was collected between 2020 and 2023.

Using the Database

Crop and Soil DoltHub

The crop and soil database is available on DoltHub as a public database. The database is only 2.65MB so you can run most queries on DoltHub to explore the data. For instance, here's all the wheat samples in the database. This database is small enough that joins do not time out on DoltHub. Here's all the wheat samples grown in Iowa in the database, which requires joining the crop_data and metadata tables. That one is for you, Andy.

If you want a copy of the database locally, you can clone it using the following command in a terminal after you've installed Dolt:

$ dolt clone our-sci/crop_and_soil_dataset

After that you can run SQL queries on the database in a terminal using the following commands:

$ cd crop_and_soil_dataset
$ dolt sql -q "select * from crop_data where sample_type='wheat'"

How Can You Use the Data?

Octavio published a tutorial on how to build predictive models using the data in R. This tutorial shows you how to create a reflectometer model, which I must say is out of my depth. I just build databases.

Why was the Data Published?

Octavio was interested in learning Dolt because he thinks its capabilities will be perfect for some of his current and future projects.

I wanted to learn how to work with Dolt and decided publishing this dataset was an interesting enough first task. My main interest in Dolt is to make work around data more auditable and repeatable. We do a lot of consulting work for our clients around data and this tool will add a lot of value to that work. I believe there may even be clients interested just in the work of translating CSV based data into organic and well schematized Dolt databases.

DoltHub's user-friendly interfaces were a also a draw for Octavio's team, even though he is comfortable with terminal interfaces himself.

I feel comfortable interacting with all my tools using text interfaces, but many colleagues prefer to work using more visual tools. For me, DoltHub is the tool that will allow me to share data that I've stored in Dolt with people without needing to even explain them in depth what Dolt is. I believe it has most of what's required to achieve this, between the spreadsheet views, views for commits, documentation, etc.

Moreover, Octavio had already built some primitive data versioning and testing capabilities but found Dolt's offering more mature.

While doing modelling, analysis, and summarization work it was pretty frequent for me and some other researchers to need to agree on conditions and definitions about the data. On many occasions, there were misunderstandings, errors, rules that got forgotten, mismatches, etc. I ended up creating my own rather primitive data version control, plus data tests. I learned about Dolt while looking for a good way to get this done in a consistent and full-feature manner and that's a core problem I want to avoid by leveraging Dolt, which so far seems to be the perfect tool for that, especially being able to show stored queries, tests, etc in DoltHub to all involved users, not only the coders.

Conclusion

The new crop and soil database published by OurSci is a great addition to the open data collection on DoltHub. Are you publishing open data you would like featured in a blog article? Come by our Discord and let us know. I'd be happy to feature your database in an article.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.