Pondering a Monorepo Version Control System

Monorepos have desirable features but git is the wrong version control system (VCS) to realize one. Here I document how the right one would look like.

Why Monorepo?

The idea of a monorepo is to put "everything" into a single version control system. In contrast, to multirepo where developers regularly use multiple repositories. I don't have a more precise definition.

Google uses a monorepo. In 2016 Google had a 86TB repository and eventually developed a custom version control system for that. They report 30,000 commits per day and up to 800,000 file read queries per seconds. Microsofts has a 300GB git repository which requires significant changes and extensions to normal git. Facebook is doing something similar with Mercurial.

An implication of this monorepo approach is that it usually contains packages for multiple programming languages. So you cannot rely on language-specific tools like pip or maven. Bazel seems to be a fitting and advanced build system for this which comes from the fact that it stems from Googles internal "Blaze". For more tooling considerations, there is an Awesome Monorepo list.

Often, people try to store all dependencies in the monorepo as well. This might include tools like compilers and IDEs but probably not the operating system. The goal is reproducability and external dependencies should vanish.

If you want to read more discussions on monorepos, read Advantages of monorepos by Dan Luu and browse All You Always Wanted to Know About Monorepo But Were Afraid to Ask.

Most of the arguments for and against monorepos are strawman rants in my opinion. A monorepo does not guarantee a "single lint, build, test and release process" for example. You can have chaos in a monorepo and you can have order with multirepos. This is a question of process and not of repository structure.

There is only one advantage: In a monorepo you can do an atomic change across everything. This is what enables you to change the API and update all users in a single commit. With multirepo you have an inherent race condition and eventually there will be special tooling around this fact. However, this "atomic change across everything" also requires special tooling eventually. Google invests heavily in into the clang ecosystem for this reason. Nothing is for free.

That said, let's assume for now you we want to go for a monorepo.

Why not git?

If you talk about version control systems these days, people usually think about git. Everybody wants to use it despite gits well-known flaws and its UI which does not even try to hide implementation details. In the context of monorepos, the relevant flaw is that git scales poorly beyond some gigabytes of data.

Git needs LFS or Annex to deal with large binary files.

Plenty of git operations (e.g. git status) check every file in the checkout. Walking the whole checkout is not feasible for big monorepos.

To lessen some pain, git can download only a limited part of the history (shallow clone) and show only parts of the repository (sparse checkout). Still, there is no way around the fact that you need a full copy of the current HEAD on your disk. This is not feasible for big monorepos either.

As an alternative, I would suggest Subversion. The Apache Foundation has a big subversion repository which contains among others OpenOffice, Hadoop, Maven, httpd, Couchdb, Zookeeper, Tomcat, Xerces. Of course, Subversion has flaws itself and its development is very slow these days. For example, merging is painful. However, Google does not have branches in its monorepo. Instead, they use feature toggles. Maybe branches are conceptually wrong for big monorepos and we should not even try?

I considered creating a new VCS because I see an open niche there which none of the Open Source VCSs can fill. One central insight is that a client will probably never look at every single file, so we must avoid any need to download a full copy. Working on the repository will be more like working on a network file system. Thus we lose the ability to work offline as a tradeoff.

Integrated build system

I already wrote about merging version control and build system. In the discussions on that, I learned about proprietary solutions like Rational ClearCase and Vesta SCM. Here is the summary of these thoughts:

We already committed to store the repo on the network. The build system also wants to store artifacts on the network, so why not store them in the repo as well? It implies that the VCS knows which files are generated.

We also committed to put all tooling in the repo for the sake of reproducability. Thus, the build system can be simple. There is no need for a configure step because there is no relevant environment outside of the repo.

Now consider the fact that no single machine might be able to generate all the artifacts. Some might require a Windows or Linux machine. Some might require special hardware. However, the artifact generation is reproducible so it does not matter which client does it. We might as well integrate continuous integration (CI) jobs into the system.

Imagine this: You edit the code and commit a change. You already ran unit tests locally, so these artifacts are already there. During the next minutes you see more artifacts pop up which depended on your changes. These artifacts might be a Windows and a Linux and an OS X build. All these artifacts are naturally part of the version you committed, so if you switch to a different branch, the artifacts change automatically. There is no explicit back and forth with a CI system. Instead, the version-control-and-build-system just fills in the missing generated files for a new committed version.

To implement this, we need a notification mechanism in our version control system. We still need special clients which are continously waiting for new commits and generate artifacts. The VCS must manage these clients and propagate artifacts as needed. Certainly, this is very much beyond the usual job of a VCS.

More Design Aspects

Since we want to store "everything" in the repo, we also want non-technical people to use it. It already resembles a network file system, so it should provide an interface nearly as easy to use. We want to enable designers to store Photoshop files in there. Managers should store Excel and Powerpoint files in there. I say "nearly" because we need additional concepts like versions and a commit mechanism which cannot be hidden from users.

The wide range of users and security concerns require an access control mechanism with fine granularity (per file/directory). Since big monorepos exist in big organizations it naturally must be integrated into the surrounding system (LDAP, Exchange, etc).

Monorepo is not for Open Source

The VCS described above sounds great for big companies and many big projects. However, the Open Source world consists of small independent projects which are more loosely coupled. This loose coupling provides a certain resilience and diversity. While a monorepo allows you atomically change something everywhere, it also forces you to do it to some degree. Looser coupling means more flexibility on when you update dependencies, for example. The tradeoff is the inevitable chaos and you wonder if we really need so many build systems and scripting languages.

Open Source projects usually start with a single developer and no long-term goals. For that use case git is perfect. Maybe certain big projects may benefit. For example, would Gnome adopt this? Even there, it seems the partition with multiple git repos works well enough.

Why not build a Monorepo VCS?

OSS projects have no need for it. Small companies are fine with git, because their repositories are small enough. Essentially, this is a product for enterprises.

It makes no sense to build it as Open Source software in my spare time. It would be a worthwhile startup idea, but I have a family and no need to take the risk right now, and my current job is interesting enough. It is not a small project that can be done on the side at work. This is why I will not build it in the foreseeable feature although it is a fascinating technical challenge. Maybe someone else can, so I documented my thoughts here and can hopefully focus on other stuff now. Maybe Google is already building it.

Some discussion on lobste.rs.

The alternative for big companies is package management in my opinion. I looked around and ZeroInstall seems to be the only useable cross-platform language-independent package manager. Of course, it cares only about distributing artifacts. Its distributed approach provides a lot of flexibility on how you generate the artifacts which can be an advantage.

Also, Amazon's build system provides valuable insights for manyrepo environments.