Let's Open-Source Web3 SQL Data
Dolt is Git for data, the world's only SQL database that you can branch, merge, clone, fork, push and pull just like a git repository.
After studying the Web3 data ecosystem here, we've become fascinated with blockchain data and how to analyze it. For example, given access to a Bitcoin node, how can we easily compute which address has the most Bitcoin? Or given an Ethereum full node, how can we determine the number of ENS addresses there are? A blockchain's rigid data model of blocks and transactions makes these questions difficult to answer than other query-able databases.
A wide variety of solutions to this problem often leverage relational models. Web3 in-house engineering teams write custom ETL to get blockchain data into Snowflake, Redshift or BigQuery instances. Application developers who want to quickly query on-chain data use The Graph's GraphQL endpoint. Business analysts studying protocol economics and application ecosystems use Dune and Flipside to run SQL queries which leverage in-house data warehouses.
Clearly, the SQL model is still useful in the era of Web3. Unfortunately, this SQL data is now under the wraps of centralized data warehouse services like Snowflake and Redshift. Could there be an easier way to quickly get relational blockchain data in a decentralized way? What if anyone could clone a SQL database of Bitcoin or Ethereum? Could anyone publish or clone a database of ENS transactions according to their smart contract spec? We think Dolt can be the best tool to quickly get access to Web3 databases using SQL for free. In this blog post we'll cover the existing landscape for Web3 SQL data, our free Bitcoin database, and our future data publishing Web3 work.
How can I get Blockchain data in SQL
The most popular way to query blockchain data in SQL is with Dune. You can find a short overview video here. What's great about Dune is that it has datasets that are specific to applications built on top of Ethereum. For example, on Dune we can query how many unique ENS addresses there are by using their ENS dataset. Another way to get blockchain data in SQL is through Flipside crypto. Flipside ingests large amounts of blockchain data into Snowflake data warehouses and parses smart contracts events accordingly. Here's a starter video on how to use Flipside. Finally, BigQuery has a public dataset. that contains several blockchains.
You can also run your own custom ETL job (which is what we did) to get blockchain data into a SQL Database. Blockchains are backed by nodes that store the current blockchain state and validate new transactions. At DoltHub, we are running Bitcoin and Ethereum nodes that each come with HTTP JSON-RPC servers for querying. For example, I can run the following Bitcoin cli command to get a sense of the data this RPC exposes.
bitcoin-cli getblock 00000000000000000006bb859028579663b9e594801b17821a1a9e161bdc8c36 1
{
"hash": "00000000000000000006bb859028579663b9e594801b17821a1a9e161bdc8c36",
"confirmations": 11,
"height": 738730,
"version": 551550980,
"versionHex": "20e00004",
"merkleroot": "1aca299c0ac6407a26e84f1eae6afefef24afc1dada40a618d1072741b04928e",
"time": 1654004841,
"mediantime": 1654003278,
"nonce": 2635736616,
"bits": "17096a20",
"difficulty": 29897409688833.63,
"chainwork": "00000000000000000000000000000000000000002edb51c45d309a5e99c834ad",
"nTx": 696,
"previousblockhash": "000000000000000000028a1f980f9980b0b54a115c9eac1a618b8b8cf068dbbb",
"nextblockhash": "000000000000000000019f833bfd8ba375b9df93593d77332486ea0748004f4e",
"strippedsize": 187154,
"size": 383761,
"weight": 945223,
"tx": [
...
]
}
You can query your own node or an API provider to get access to block and transaction data. You can then write a service that sends requests to get JSON data and convert that into SQL.
How we got Bitcoin into Dolt
We wanted to either create open-source ETL work or leverage an existing open-source product for data ingestion. We heavily relied on the Blockchain ETL team's BitcoinETL project as a core part of our ETL process. Below is a brief description of our process to ingest Bitcoin into dolt.
We started by running a Bitcoin node in a 2TB EC2 instance. We then created a systemd
service which uses BitcoinETL
to stream each block of the blockchain to an S3 Bucket. Finally, we wrote a program that every 20 minutes downloads the
latest blocks stored in S3 and imports them into Dolt.
To be frank, this process was hacky and not up to data-engineering best practices. We're still in a prototyping
phase with the best way to get blockchain data into Dolt. In the future, we want to move our Bitcoin Node as well
as our ETL jobs into our Kubernetes infrastructure where there is far better reliability and reproducibility.
Bitcoin in SQL
To better understand our Bitcoin database, let's review Bitcoin's UTXO transactional model.
In the above example, Alice wants to send 1BTC to Bob. Her wallet is associated with two bitcoin UTXO units, 0.5 BTC and 0.7 BTC. These form the input of a transaction. The outputs of the transaction are the 1BTC that Bob receives and the leftover 0.2 BTC that Alice receives. Essentially, the Bitcoin blockchain represents a system of credits and debits where each transaction has a set of unspent transaction inputs and a set of unspent transaction outputs (UTXO).
Now let's go ahead and explore our actual SQL Tables and see what's in the database. Let's first start by cloning the Bitcoin database
> dolt clone web3/bitcoin
> dolt sql
bitcoin> show tables;
+--------------------------------+
| Tables_in_bitcoin |
+--------------------------------+
| blocks |
| transaction_inputs |
| transaction_outputs |
| transactions |
+--------------------------------+
bitcoin>
We have 4 tables. The blocks
table refers to each individual block on the Blockchain. The transactions
table refers
to individual transactions conducted between two parties. The transaction_inputs
table references the input UTXOs per
transaction that combines to the Bitcoin amount to be sent in a transaction. Finally, the transaction_outputs
table
represents the output UTXOs that go to a recipient or any change UTXO sent back to the user.
With the Bitcoin Blockchain in SQL, we can now do all sorts of analysis.
- Which addresses have the most Bitcoin?
Let's start by asking which Bitcoin address has the most Bitcoin. To do that we need to combine our transaction_inputs
and transaction_outputs
table to get a sense of each wallet and the remaining UTXO bitcoin it has. If we think of each
transaction's input as a debit and each output as a credit on an address we can sum all address associated inputs
and outputs to get an address' balance. Credit to Allen Day
for the perfect SQL query:
WITH double_entry_book AS (
SELECT
JSON_EXTRACT(addresses,
'$[0]') AS address,
inputs.type,
- inputs.value * 1e - 8 AS value
FROM
`transaction_inputs` AS inputs
UNION ALL
SELECT
JSON_EXTRACT(addresses,
'$[0]') AS address,
outputs.type,
outputs.value * 1e - 8 AS value
FROM
`transaction_outputs` AS outputs
)
SELECT
address,
type,
sum(value) AS balance
FROM
double_entry_book
GROUP BY
1,
2
ORDER BY
balance DESC
LIMIT 3
Against a fully updated database, we get the following result. According to bitinfocharts these wallets are likely cold wallets used by cryptocurrency exchanges.
+----------------------------------------------------------------+-----------------------+----------------+
| address | type | balance |
+----------------------------------------------------------------+-----------------------+----------------+
| 34xp4vRoCGJym3xR7yCVPFHoCNxv4Twseo | scripthash | 25259723615516 |
| bc1qgdjqv0av3q56jvd82tkdjpy7gdp9ut8tlqmgrpmv24sq90ecnvqqjwvw97 | witness_v0_scripthash | 16800998577449 |
| 1P5ZEDWTKTFGxQjZphgWPQUpe554WKDfHQ | pubkeyhash | 12787354435112 |
+----------------------------------------------------------------+-----------------------+----------------+
- What happened to the Bitcoin pizza
One of the most unfortunate incidents to happen in the Bitcoin world was in May 2010, when popular BitcoinTalk community member laszlo decided to spend 10,000 Bitcoins for some pizza. Let's see if we can track this transaction down.
SELECT
transactions.hash,
transaction_outputs.value,
from_unixtime(transactions.block_timestamp)
FROM
transactions
INNER JOIN transaction_outputs ON transactions.hash = transaction_outputs.txid
WHERE
transaction_outputs.value = 1e12
ORDER BY
transactions.block_timestamp;
+------------------------------------------------------------------+---------------+---------------------------------------------+
| hash | value | from_unixtime(transactions.block_timestamp) |
...
| a1075db55d416d3ca199f55b6084e2115b9345e16c5cf302fc80e9d5fbf5d48d | 1000000000000 | 2010-05-22 14:16:31 -0400 EDT |
If we look at the bitcoin talk post we'll see that laszlo confirmed
sent the transaction at 07:17:26 PM
. The closest transaction to that time is a1075db55d416d3ca199f55b6084e2115b9345e16c5cf302fc80e9d5fbf5d48d
which you can get a good look at here.
- Where can I find the wallet that holds the hacked Mt.Gox BTC?
For those new to Bitcoin lore, Mt.Gox was the largest Bitcoin exchange in the early 2010s. In 2011, Mt. Gox was hacked with 79956 Bitcoins sent to a mystery wallet.
We can find this transaction with the following query.
select * from transaction_outputs where value = 79956 * 1e8\G
+------------------------------------------------------------------+----------------------------------------+---------------+
| txid | addresses | value |
+------------------------------------------------------------------+----------------------------------------+---------------+
| e67a0550848b7932d7796aeea16ab0e48a5cfe81c4e8cca2c5b03e0416850114 | ["1FeexV6bAHb8ybZjqQMjJrcCrHGW9sb6uF"] | 7995600000000 |
+------------------------------------------------------------------+----------------------------------------+---------------+
This transaction output is associated with transaction e67a0550848b7932d7796aeea16ab0e48a5cfe81c4e8cca2c5b03e0416850114
and the infamous wallet 1FeexV6bAHb8ybZjqQMjJrcCrHGW9sb6uF
which has recently come into the spotlight again with massive
lawsuits.
Dolt + Web3
We don't think open-source blockchain data should sit in some company's Snowflake instance or behind a public centralized API. We think it should be in a decentralized, open database anyone can copy, modify, and query. Dolt is one of many decentralized databases. Dolt supports the MySQL dialect and is built on the Git model of decentralization. We can publish the data for anyone to consume on DoltHub, Dolt's version of GitHub. Anyone can back their application on a running Dolt MySQL server. In the future, we are eager to move into the Ethereum ecosystem and create ETL jobs that track specific smart contract events to get more granular d-application data, like what you see on Dune. Swing by our Discord if you want to chat more about Dolt or Web3.