At DigitalOcean, we’ve used a “mono repo” called cthulhu to organize our Go code for nearly three years. A mono repo is a monolithic code repository which contains many different projects and libraries. Bryan Liles first wrote about cthulhu in early 2015, and I authored a follow-up post in late 2015.
A lot has changed over the past two years. As our organization has scaled, we have faced a variety of challenges while scaling cthulhu, including troubles with vendoring, CI build times, and code ownership. This post will cover the state of cthulhu as it is today, and dive into some of the benefits and challenges of using a mono repo for all of our Go code at DigitalOcean.
Our journey using Go with a mono repo began in late 2014. Since then, the repository, called "cthulhu", has grown exponentially in many ways. As of October 6th, 2017, cthulhu has:
As the scale of the repository has grown over the past three years, it has introduced some significant tooling and organizational challenges.
Before we dive into some of these challenges, let’s discuss how cthulhu is structured today (some files and directories have been omitted for brevity):
│ └── src
│ └── do
│ ├── doge
│ ├── exp
│ ├── services
│ ├── teams
│ ├── tools
│ └── vendor
`docode/` is the root of our `GOPATH`. Readers of our previous posts may notice that `third_party` no longer exists, and `do/` is now the prefix for all internal code.
## Code Structure
All Go code lives within our `GOPATH`, which starts at `cthulhu/docode`. Each directory within the `do/` folder has a unique purpose, although we have deprecated the use of `services/` and `tools/` for the majority of new work.
`doge/` stands for “DigitalOcean Go Environment”, our internal “standard library”. A fair amount of code has been added and removed from `doge/` over time, but it still remains home to a great deal of code shared across most DigitalOcean services. Some examples include our internal logging, metrics, and gRPC interaction packages.
`exp/` is used to store experimental code: projects which are in a work-in-progress state and may never reach production. Use of `exp/` has declined over time, but it remains a useful place to check in prototype code which may be useful in the future.
`services/` was once used as a root for all long-running services at DO. Over time, it became difficult to keep track of ownership of code within this directory, and it was replaced by the `teams/` directory.
`teams/` stores code owned by specific teams. As an example, a project called “hypervisor” owned by team “compute” would reside in `do/teams/compute/hypervisor`. This is currently the preferred method for organizing new projects, but it has its drawbacks as well. More on this later on.
`tools/` was once used to store short-lived programs used for various purposes. These days, it is mostly unused except for CI build tooling, internal static analysis tools, etc. The majority of team-specific code that once resided in `tools/` has been moved to `teams/`.
Finally, `vendor/` is used to store third-party code which is vendored into cthulhu and shared across all projects. We recently added the prefix `do/` to all of our Go code because existing Go vendoring solutions did not work well when `vendor/` lived at the root of the `GOPATH` (as was the case with our old `third_party/` approach).
`script/` contains shell scripts which assist with our CI build process. These scripts perform tasks such as static analysis, code compilation and testing, and publishing newly built binaries.
One of the biggest advantages of using a mono repo is being able to effectively make large, cross-cutting changes to the entire repository without fear of breaking any “downstream” repositories. However, as the amount of code within cthulhu has grown, our CI build times have grown exponentially.
Even though Go code builds rather quickly, in early 2016, CI builds took an average of 20 minutes to complete. This resulted in extremely slow development cycles. If a poorly written test caused a spurious failure elsewhere in the repo, the entire build could fail, causing a great deal of frustration for our developers.
After experiencing a great deal of pain because of slow and unreliable builds, one of our engineers, Justin Hines, set out to solve the problem once and for all. After a few hours of work, he authored a build tool called `gta`, which stands for “Go Test Auto”. `gta` inspects the git history to determine which files changed between master and a feature branch, and uses this information to determine which packages must be tested for a given build (including packages that import the changed package).
As an example, suppose a change is committed which modifies a package, `do/teams/example/droplet`. Suppose this package is imported by another package, `do/teams/example/hypervisor`. `gta` is used to inspect the git history and determine that both of these packages must be tested, although only the first package was changed.
For very large changes, it can occasionally be useful to test the entire repository, regardless of which files have actually changed. Adding “force-test” anywhere in a branch name disables the use of `gta` in CI builds, restoring the old default behavior of “build everything for every change”.
The introduction of `gta` into our CI build process dramatically reduced the amount of time taken by builds. An average build now takes approximately 2-3 minutes—a dramatic improvement over the 20 minute builds of early 2016. This tool is used almost everywhere in our build pipeline, including static analysis checks, code compilation and testing, and artifact builds and deployment.
Every change committed to cthulhu is run through a bevy of static analysis checks, including tools such as `gofmt`, `go vet`, `golint`, and others. This ensures a high level of quality and consistency between all of our Go code. Some teams have even introduced additional tools such as `staticcheck` for code that resides within their `teams/` folder.
We have also experimented with the creation of custom linting tools that resolve common problems found in our Go code. One example is a tool called `buildlint` that checks for a blessed set of build tags, ensuring that tags such as `!race` (exclude this file from race detector builds) cannot be used.
Static analysis tools are incredibly valuable, but it can be tricky to introduce a new tool into the repository. Before we decided to run `golint` in CI, there were nearly 1,500 errors generated by the tool for the entirety of cthulhu. It took a concerted effort and several Friday afternoons to fix all of these errors, but it was well worth the effort. Our internal `godoc` instance now provides a vast amount of high quality documentation for every package that resides within cthulhu.
While there are many advantages to the mono repo approach, it can be challenging to maintain as well.
Though many different teams contribute to the repository, it can be difficult to establish overall ownership of the repository, its tooling, and its build pipelines. In the past, we tried several different approaches, but most were unsuccessful due to the fact that customer-facing project work typically takes priority over internal tooling improvements. However, this has recently changed, and we now have engineers working specifically to improve cthulhu and our build pipelines, alongside regular project work. Time will tell if this approach suits our needs.
The issue of code vendoring remains unsolved, though we have made efforts to improve the situation. As of now, we use the tool “govendor” to manage our third-party dependencies. The tool works well on Linux, but many of our engineers who run macOS have reported daunting issues while running the tool locally. In some cases, the tool will run for a very long time before completion. In others, the tool will eventually fail and require deleting and re-importing a dependency to succeed. In the future, we’d also like to try out “dep”, the “official experiment” vendoring tool for the Go project. At this time, GitHub Enterprise does not support Go vanity imports, which we would need to make use of dep.
As with most companies, our organizational structure has also evolved over time. Because we typically work in the `teams/` directory in cthulhu, this presents a problem. As of now, our code structure is somewhat reliant on our organizational structure. Because of this, code in `teams/` can become out of sync with the organizational structure, causing issues with orphaned code, or stale references to teams that no longer exist. We don’t have a concrete solution to this problem yet, but we are considering creating a discoverable “project directory service” of sorts so that our code structure need not be tied to our organizational structure.
Finally, as mentioned previously, scaling our CI build process has been a challenge over time. One problem in particular is that non-deterministic or “flappy” tests can cause spurious failures in unrelated areas of the repository. A test typically flaps when it relies on some assumption which cannot be guaranteed, such as timing or ordering of concurrent operations. This problem is compounded when interacting with a service such as MySQL in an integration test. For this reason, we encourage our engineers to do everything in their power to make their tests as deterministic as possible.
We’ve been using cthulhu for three years at DigitalOcean, and while we’ve faced some significant hurdles along the way, the mono repo approach has been a huge benefit to our organization as a whole. Over time, we’d like to continue sharing our knowledge and tools, so that others can reap the benefits of a mono repo just as we have.
*Matt Layher is a software engineer on the Network Services team, and a regular contributor to a wide variety of open source networking applications and libraries written in Go. You can find Matt on Twitter and GitHub.*