Decentralized Wikipedia Update

USE CASE
6 min read

Dolt is a decentralized database. In the past ten years or so decentralization has gone through a few hype cycles. I think we're in a pro-decentralization hype period right now. Decentralization hype tracks the price of Bitcoin and last I checked, that's near all time highs.

Decentralization in the Dolt context means open data that looks like open source: forks, branches, and pull requests. In service of Dolt's decentralized mission, we think a cool database that would benefit from this model is Wikipedia. We got Dolt to work with Media Wiki and started importing a Wikipedia dump back in April. The Wikipedia database is on DoltHub. Is the import complete? Not even close. So, I thought it was time for an update.

Dolt + Wikipedia

How We Got Here

We have a Discord User Adam who really thinks Wikipedia should be published in an easily consumable form. He tried to get a dump into Dolt himself and could not. A lot of my ego is wrapped up in Dolt so I got mad and rage fixed. If it works with MySQL, it has to work with Dolt!

Adam Wikipedia

A few weeks later I had a working Media Wiki with a few hundred thousand Wikipedia pages imported. This is the first material commit of Wikipedia pages.

commit j70r7kdfr89p2c9kcpk6builn1t0m3s3 
Author: timsehn <tim@dolthub.com>
Date:  Wed Mar 13 18:48:25 -0700 2024

        Progress to about 72,500
        

Fast forward seven months of import work and we've learned a ton about how to run Dolt at scale. I have to leave my Mac laptop at work, chugging along importing. I have a Windows machine at home. Frankly, it has been a nice break from rucking my laptop to and from work. Yes. I'm a walking commuter. We're at about 7.2M pages imported.

commit 0rk1egtiio1536kthp12f1tv8k5l5e9l (HEAD -> main) 
Author: timsehn <tim@dolthub.com>
Date:  Thu Dec 05 09:57:18 -0800 2024

        7,227,000 pages imported

I started this task thinking there were 6.7M pages in Wikipedia. But, much to my chagrin, an article is not a page. There are 6.7M articles in Wikipedia. After surpassing 6.7M pages imported, I realized this. How many pages do I have to import? It turns out there is a --dry-run flag that processes the import but doesn't run the SQL. This took about 4 hours to complete.

Wkipedia Dry Run Complete

23.5M pages. It looks like I have a few more months of importing to go. I get about 40,000 pages per day in.

Why not parallelize the work you say? Well, the Wikipedia database uses full-text indexes. Dolt supports full-text indexes but can't merge them. So, merges would have to rebuild the whole full-text index which would take an unknown amount of time and memory. I've considered modifying the import to remove the full-text indexes altogether. But, I'm kind of pot committed to my current approach. Give me a couple more months to get bored.

Why It Matters

Wikipedia has been in the news lately with Elon Musk tweeting (errr...Xing...errr) at Jimmy Wales.

Elon Wikipedia

I think if you could create your own fork of Wikipedia easily, sync with the main version, and resolve conflicts as necessary, a few competing Wikipedias could emerge. Or, just having the threat of that outcome would hold the editors of Wikipedia more accountable and probably reduce any political bias that exists. Decentralized data governance for something as important as Wikipedia, the open repository of the world's knowledge, would be a massive improvement.

Fork-able, Sync-able Wikipedia

Imagine Wikipedia as a GitHub repository. Each page is a file. This is going to be a very big Git repository. You're going to have trouble serving and editing it at web scale. So, Wikipedia is backed by a database, MySQL. To have a Git-style Wikipedia, you need a Git-style database that supports branch and merge. That is Dolt and it is MySQL-compatible. Dolt enables a fork-able, sync-able Wikipedia as described above.

Example

This example is from the original MediaWiki article but still holds true. It's a great example of the decentralized open data workflow that Dolt and DoltHub enable.

Set up a Local Branch to Edit

First, you will need to clone the Wikipedia database. That will take a while. Follow the steps in the original article to get Media Wiki running against that database. Then, you'll be all set up.

We're going to make a new branch using the Dolt CLI. Navigate to the directory you cloned the Wikipedia database to, in our case ~/dolt/, and go into the directory called media_wiki. Then use the dolt branch command to create a branch, just like you would in Git.

$ pwd
/Users/timsehn/dolt/media_wiki
$ dolt branch local
$ dolt branch
  local
* main

Now, we have a new branch called local to connect to.

Point your MediaWiki at the Branch

Navigate to the root of your MediaWiki install and edit LocalSettings.php to point at your new branch.

$ cd /opt/homebrew/var/www/w/

In the database section of LocalSettings.php you just add the branch name at the end of the database name.

## Database settings
$wgDBtype = "mysql";
$wgDBserver = "127.0.0.1";
- $wgDBname = "media_wiki";
+ $wgDBname = "media_wiki/local";
$wgDBuser = "root";
$wgDBpassword = "";

Now, you are connecting to the new branch called local.

I can confirm this in the debug logs of the running SQL server.

DEBU[0251] Starting query connectTime="2024-04-03 11:39:25.297596 -0700 PDT m=+9.767598584" connectionDb=media_wiki/local connectionID=1 query="..."

Make a Commit and Push

Now let's make a new page on our branch and then make a Pull Request on DoltHub.

Empty page

Click Create this page.

Create page

Create your article and you'll end up with something like this:

New Page

Stop the SQL server and check out the local branch.

$ dolt checkout local

You can see what you've created by examining the diff. Take a moment to marvel how cool Dolt is. Unlike other databases, you can see what you changed!

$ dolt status
On branch local

Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
	modified:         watchlist
	modified:         slots
	modified:         module_deps
	modified:         recentchanges
	modified:         log_search
	modified:         content
	modified:         user
	modified:         job
	modified:         revision
	modified:         objectcache
	modified:         page
	modified:         comment
	modified:         searchindex
	modified:         logging
	modified:         text
$ dolt diff text
diff --dolt a/text b/text
--- a/text
+++ b/text
+---+---------+----------------------------------------------------+-----------+
|   | old_id  | old_text                                           | old_flags |
+---+---------+----------------------------------------------------+-----------+
| + | 1209461 | The world's first version controlled SQL database. | utf-8     |
+---+---------+----------------------------------------------------+-----------+
$ dolt diff page
diff --dolt a/page b/page
--- a/page
+++ b/page
+---+---------+----------------+---------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
|   | page_id | page_namespace | page_title    | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---+---------+----------------+---------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| + | 1209460 | 0              | Dolt_Database | 0                | 1           | 0.651856670903 | 20240403191437 | 20240403191437     | 1211506     | 50       | wikitext           | NULL      |
+---+---------+----------------+---------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+

Now we'll make a Dolt commit on our branch so we can send the changes to DoltHub.

$ dolt commit -am "Added Dolt database page"
commit n8v7boeva91c198ip4h1uichrl0thssp (HEAD -> local)
Author: timsehn <tim@dolthub.com>
Date:  Wed Apr 03 11:59:44 -0700 2024

        Added Dolt database page

Finally, we push our changes to DoltHub.

$ dolt push origin local
/ Uploading...
To https://doltremoteapi.dolthub.com/timsehn/media_wiki
 * [new branch]          local -> local

Open a PR on DoltHub

Now, we want our changes reviewed and merged into the main copy of Wikipedia. Our local copy continues to have our new article and users of it can continue to enjoy our version. This is the beauty of decentralized collaboration. There can be multiple competing Wikipedias!

So, we open a Pull Request on DoltHub. Obviously, you could create your own coordination user interface using Dolt primitives. You don't need DoltHub.

New Pull Request

After submitting the form, I am greeted by the Pull Request page. I can send reviewers to this page to review and comment on my changes.

Pull Request

The reviewers can even review a diff.

Pull Request Diff

We're biased but we think this decentralized collaboration workflow has a lot of promise for data like Wikipedia. Can we get a decentralized encyclopedia with many competing versions? Dolt is here to help make that a reality.

Conclusion

So, it's going to be a few more months to have a full Wikipedia import into Dolt. You can follow along on the progress by looking at the commit log on DoltHub. If you're interested in discussing this project come by our Discord and let's chat.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.