Static Analysis at GitHub - ACM Queue

An experience report

Timothy Clem and Patrick Thomson

GitHub, a code-hosting website built atop the Git version-control system, hosts hundreds of millions of repositories of code uploaded by more than 65 million developers. The Semantic Code team at GitHub builds and operates a suite of technologies that power symbolic code navigation on github.com. Symbolic code navigation lets developers click on a named identifier in source code to navigate to the definition of that entity, as well as the reverse: given an identifier, they can list all the uses of that identifier within the project.

This system is backed by a cloud object-storage service, having migrated from a multi-terabyte sharded relational database, and serves more than 40,000 requests per minute, across both read and write operations. The static-analysis stage itself is built on an open-source parsing toolkit called Tree-sitter, implements some well-known computer science research, and integrates with the github.com infrastructure in order to extract name-binding information from source code.

The system supports nine popular programming languages across six million repositories. Scaling even the most trivial of program analyses to this level entailed significant engineering effort, which is recounted here in the hope that it will serve as a useful guide for those scaling static analysis to large and rapidly changing codebases.

Motivation: Seeing the Forest for the (Parse) Trees

Navigating code is a fundamental part of reading, writing, and understanding programs. Unix tools such as grep(1) allow developers to search for patterns of text, but programmers' needs are larger in scope: What they're most interested in is how the pieces of a program stitch together—given a function, where is it invoked, and where is it defined? Quick and quality answers to these queries allow a programmer to build up a mental model of a program's structure; that, in turn, allows effective modification or troubleshooting. Tools such as grep that are restricted to text matching and have no knowledge of program structure often provide too little or too much information.

Fluent code navigation is also an invaluable tool for researching bugs. The stack trace in an error-reporting system starts a journey of trying to understand the state of the program that caused that error; navigating code symbolically eases the burden of understanding code in context. As such, most IDEs (integrated development environments) have extensive support for code navigation and other such static analyses that ease the user's burden.

The Semantic Code team wanted to bring this IDE-style symbolic code navigation to the web on github.com. The team was inspired by single-purpose sites such as source.dot.net, Mozilla, and Chromium Code Search that provide comprehensive in-browser code-navigation. The question was how to do that at scale: GitHub serves more than 65 million developers contributing to over 200 million repositories across 370 programming languages. In the last 12 months alone, there were 2.2 billion contributions on github.com. That's a lot of code and a lot of changes.

Philosophy: To Tree or Not to Tree

The Semantic Code team's approach to implementing code navigation centers around the following core ideas.

1. Zero configuration

The end user doesn't have to do any setup or configuration to take advantage of code navigation, beyond pushing code to GitHub. There are no settings or customizations or opt-in features—if a repository's language is supported, it should just work. This is critical for this particular use case, since if you view a source-code file on GitHub in a supported language, the expectation is that code navigation should just work. If every open-source project had to do even a little extra work to configure its repo or set up a build to publish this information, the experience of browsing code on GitHub would vary dramatically from project to project, and the time between push and being able to use code navigation might depend on slow and complex build processes.

It's not sufficient to require that developers clone and spin up their own IDEs (or wait for an in-browser IDE such as GitHub Codespaces to load); developers are expected to be able to read and browse code quickly without having to download that code and its associated tooling. For this feature to scale and serve all of GitHub, it has to be available everywhere and in every project. The goal is for developers to focus on their programs and the problems they are trying to solve, not on configuring GitHub to work properly with their projects or convincing another project owner to get the settings right.

2. Incrementality

For each change pushed to a repository, the back-end processing should have to do work only on the files that changed. This is different from instituting a continuous-integration workflow, in which a user might specifically want a fresh environment for repeatable builds. It also hints that results will be available more quickly after push—on the order of seconds, not minutes. Waiting an entire build cycle for code-navigation data to show up isn't tenable for the desired user experience; developers expect the navigation feature to keep pace with their changes.

3. Language agnosticism

The same back-end processing code should be run and operated regardless of the language under analysis. Consequently, the team decided not to run language-specific tooling, such as the Roslyn project for C# or Jedi for Python, as that would require operating a different technology stack for each language (and sometimes for each version of a language).

Though this means that the language grammars may accept a superset of a given language, this philosophy yields the ability to scale and deliver results much faster. The infrastructure can run a single code stack, there's no management of multiple containers and associated resource costs, there's no cold start time for bringing up tooling, and there's no attempt to detect a project's structure, configuration, or target language version.