Caching Gatsby builds with Docker in GitHub Actions
At DoltHub, we write a lot of blogs. Each week our team publishes three to four blogs on various topics. Some recent blogs include our ongoing effort to publish Hospital Price Data in a single, coherent database, the release of Dolt v0.75.0 which includes support for Spatial Indexes and ACID transactions, and Type Embedding in Golang.
To create a new blog post for our site, a team member writes a markdown file and checks it in to our blog source directory. The content in this directory is served by Gatsby and we deploy our Gatsby blog app inside a Docker container running on a Kubernetes host.
Luckily for our team, the hardest part of creating a new blog post is writing it. But, the most annoying part about publishing the blog post is waiting FOREVER for the thing to deploy.
Our team members can trigger a blog deployment to our development
and production
sites by commenting on the pull request containing their new post, like so:
#deploy [blog]
This comment triggers a GitHub Actions workflow that builds and deploys their changes. Now, although it's fast to trigger a deployment, it used to take up to 30 minutes to actually see the deployment live. Most of this time was spent running gatsby build
.
This was a real pain point. Anytime team members had typos or suggested changes they needed to incorporate into their drafts, they'd have to wait 30 minutes for each iteration of changes to go out.
You can imagine how frustrating it would be to experience this three to four times a week, so we finally set out to do something about it. To speed up our blog deployments, we devised a way to take advantage of Gatsby's Incremental Builds using Docker in GitHub Actions.
Gatsby's Incremental Builds
Gatsby released Incremental Builds in Gatsby V3 and it's great for decreasing build times. As stated on their site, "an incremental build is a build of Gatsby that generates only the subset of HTML files that needed to be updated based on the changes made to the site." This means that after an initial build, which will be the slowest, subsequent builds will be much faster because Gatsby will only need to build what's changed since the previous build.
This worked great for us locally. If we ran gatsby build
on our local computer, Gatsby would generate a .cache
directory containing it's build cache, as well as the public
directory of static files it serves when running. Then, if we make a change to the blog and run gatsby build
again, the command finishes super fast since the .cache
and public
directories were already created and populated with most of the output the first time we ran build
.
But, whenever we ran gatsby build
in the context of GitHub Actions during a deployment, we'd never benefit from incremental building. In that environment, gatsby always performed the super slow initial build every time.
The reason we never got incremental builds within GitHub Actions is because we use GitHub Hosted Runners for all of our GitHub Actions workflows. These runners are essentially fresh hosts and environments that come with an admittedly solid number of dependencies pre-installed for convenience, but no persistent storage that lives beyond a job run, let alone a workflow run.
To tackle this problem and speed up build times, GitHub Actions does provide the actions/cache action which is used for caching dependencies and build outputs—kinda exactly what we wanted for this project (or so we thought). Turns out our embedding of Gatsby within Docker made it a bit tricky to take advantage of this caching action, but we came up with a couple of bad ideas to try and do so anyway.
How NOT to cache Gatsby builds
In the GitHub Actions context, the gatsby build
command we run isn't actually executed in the shell of the runner. Instead, it's defined in our blog's Dockerfile and executed during the docker build
step. So, right off the bat, we needed to come up with a way for the actions/cache
to cache directories that only existed in the Docker context.
FROM node:18.13.0 as builder
RUN mkdir build
WORKDIR build
COPY . .
WORKDIR packages/blog
RUN yarn
RUN yarn compile
# `yarn build` runs `gatsby build`
RUN NODE_OPTIONS="--max-old-space-size=6000" yarn build
FROM gatsbyjs/gatsby:d5fdc9e9
ARG BUILD_SCM_REVISION
RUN mkdir -p /pub/blog/
COPY /build/packages/blog/public/ /pub/blog/
As you can see in the above definition, we COPY
the blog source into the Docker build context and run yarn build
, which just runs gatsby build
. After building, Gatsby's public
directory is copied to a generic Gatsby base image, which, when run, serves the /pub/blog/
static directory.
This COPY
command gave us our first idea for trying to use incremental building during deploys. The idea was to run gatsby build
in a runner's shell before building the Dockerfile. Since the GitHub runner and the base Docker builder
image are both linux based, we thought, (I thought), there might be a chance we could build the blog outside the Docker context, then just COPY
the .cache
and public
directories over to the Docker context and it might "just work." If this did work, we could then use the actions/cache
to cache .cache
and public
for our deployment workflow and violà! We'd have a faster deployment time.
But this did not work.
This first bad idea got us thinking a bit more. It seemed like what we really wanted here was to create some empty directories on the runner, then mount those into our Docker context. Doing this would ensure gatsby build
generates the correct output for the Docker environment and container file system while also persisting the outputs to the runner's disk. This would make these mounted directories accessible to the actions/cache
, and we'd be in business.
For those familiar with Docker, mounting directories to a container is pretty easy. You just do docker run -v /local/path:/path/on/container ...
. The problem, though, if you haven't noticed already, is that volume mounting is actually a docker run
option, not a docker build
option.
As it turns out, there is no way to mount volumes at build time with Docker. Well, at least not with a first class -v
option like the docker run
command supports.
Docker has a documentation page for optimizing builds and it refers to an option to use a specialized cache called a "dedicated Run cache" that provides caching between runs. From the documentation:
RUN \
\
apt-get update && apt-get install -y git
"Using the explicit cache with the --mount flag keeps the contents of the target directory preserved between builds. When this layer needs to be rebuilt, then it’ll use the apt cache in /var/cache/apt."
Reading the above information, we came up with our next bad idea. Largely based on the use of --mount
in the above command, we assumed that with this syntax Docker was creating a volume backed by a host directory and attaching it to the container. This must be, we reasoned, (I did), how Docker does the caching between builds. To use the actions/cache
, then, we'd just need to point the action at the directory Docker is uses as the volume.
As it turns out, though, and as docker volume inspect
would reveal during experimentation, Docker isn't created a volume when the dedicated run cache --mount
syntax is used. Likely, Docker is just caching /var/cache/apt
in another layer in its build cache.
"What about just caching Docker's build cache with the actions/cache
then?" you might be wondering. Great question, but it actually doesn't get us any closer to taking advantage of Gatsby's Incremental Builds. Even if we cache the blog's Docker cache in GitHub Actions, every modification to the blog source would trigger a rebuild of the Docker layer that runs the gatsby build
command. So we'd still be generating Gatsby outputs for the first time, every time.
After experimenting with these bad ideas and lots of cursing, we, (Aaron), had a pretty good idea. Looking at our blog Dockerfile again, we thought, "why don't we make the base docker image of the blog contain the latest gatsby build outputs? Then, after a deployment, we can update the base docker image with the newer outputs."
If you're unfamiliar, we have a term for the infinite nesting of Docker containers here at DoltHub—it's called dockerception. And, the cool part about our latest dockerception strategy is that even though it doesn't use the actions/cache
action at all, it lets us achieve our goal and benefit from incremental builds during our blog deployments. Here's how we did it:
What works
First, we created a Dockerfile for our base Gatsby cache image.
FROM node:18.13.0 as builder
RUN mkdir build
WORKDIR build
COPY . .
WORKDIR packages/blog
RUN yarn
RUN yarn compile
# `yarn build` runs `gatsby build`
RUN NODE_OPTIONS="--max-old-space-size=6000" yarn build
FROM scratch
COPY /build/packages/blog/.cache /cache
COPY /build/packages/blog/public /public
Notice it looks pretty similar to our original blog Dockerfile. This docker image still runs yarn build
(which is gatsby build
), but instead of copying the output public
directory to a base Gatsby docker image, we copy both public
and .cache
to a scratch Docker image. Assuming we build this image for the latest blog content, which we do, the final image here will have all the Gatsby build outputs at /cache
and /public
.
Next, we push this image to an ECR repository as dolthub-blog-cache:latest
so that we can refer to it in our original blog Dockerfile as the base image. Here is our updated blog Dockerfile:
FROM dolthub-blog-cache:latest as cache
FROM node:18.13.0 as builder
RUN mkdir build
WORKDIR build
COPY . .
WORKDIR packages/blog
COPY /cache /build/packages/blog/.cache
COPY /public /build/packages/blog/public
RUN yarn
RUN yarn compile
# `yarn build` runs `gatsby build`
RUN NODE_OPTIONS="--max-old-space-size=6000" yarn build
FROM scratch as new_blog_cache
COPY /build/packages/blog/.cache /cache
COPY /build/packages/blog/public /public
FROM gatsbyjs/gatsby:d5fdc9e9
ARG BUILD_SCM_REVISION
RUN mkdir -p /pub/blog/
COPY /build/packages/blog/public/ /pub/blog/
In this new version, before we run yarn build
in the builder
image, we COPY
the two output folders from the base cache image to the place where Gatsby expects them. Then we run yarn build
which sees both .cache
and public
, and now we get Gatsby's Incremental Build working. As a result, this step runs very quickly.
Just like in the original blog Dockerfile, we copy the public
directory to the generic Gatsby image to be served at runtime, but we've added one additional step before that. In this updated Dockerfile, we define a new image called new_blog_cache
that uses a scratch
base. In this image, we copy the now updated outputs from the builder
image that ran gatsby build
.
This new_blog_cache
image is then tagged and pushed to ECR as the new dolthub-blog-cache:latest
, ensuring we always start our blog docker builds with the most up-to-date Gatsby build outputs. It also ensures that subsequent blog builds are incremental 🤠.
Conclusion
After these changes, our blog deployment times dropped from 30 minutes to sub 10 minutes! Doing this work was definitely been worth the effort and has been a huge improvement for our blog development cycle. In combination with ChatGPT, writing our three weekly blogs is practically effortless 😉. (Big shout out to Midjourney for the images used above).
We love getting feedback, questions, and feature requests from our community so if there's anything you'd like to see added in DoltHub, DoltLab or one of our other products, please don't hesitate to reach out.
You can checkout each of our different product offerings below, to find which ones are right for you:
- Dolt—it's Git for data.
- DoltHub—it's GitHub for data.
- DoltLab—it's GitLab for data.
- Hosted Dolt—it's RDS for Dolt databases.