1-Pager | Notion

AlphaStar[0] and OpenAI Five[1] have demonstrated that deep learning models can be trained to infer strategies capable of defeating humans, exhibiting long-term planning in very complex action spaces; two crucial elements required for automated reasoning. These techniques have been enabled by large-scale reinforcement learning training of models with millions of parameters. Such training scale was only recently made possible by a sustained decrease in the cost of compute. Concurrently over the last decade, theorem provers have grown datasets that are large enough to serve as rich learning environments for deep learning models (HOL Light with the Kepler conjecture; Coq with Feit-Thompson). Recent advances in formal program verification also enables the generation of very large datasets of proof obligations that can serve the same purpose.

Based on the above observations, deep-learning and formal methods are likely to be on a collision course whose potential outcome is a substantial increase in our automated reasoning capabilities; This realization motivated us to explore these two domains together for the past 6 months and continue to do so to enable powerful new applications.

There's two levels of abstraction at which one can plug a deep-learning model in order to automate reasoning in formal systems. The tactics level ((human) user interface of theorem provers) and the kernel level (low-level logic rewrites generated by theorem provers). The former is closer to human intuition, smaller in size (as a dataset) but harder to model (some tactics are quite complex). The latter is larger in size, easier to model (elementary logical rewrites are simple) but also more "mechanical".

While both are potentially interesting, we decided to focus first on the kernel approach, seeing tactics as the Haar Features of automated reasoning; designed by and for humans on top of a more fundamental underlying logical fabric. Our first task was to build a dataset by "tracing" the HOL Light theorem prover (which has one of the most extensive library of mathematical theorems available but also a very simple underlying logic system). The dataset itself is succinctly described in [2] and the resulting code got merged[3] into HOL Light by its creator, John Harrison.

This raw dataset requires a bit of post-processing to be made adequate for deep-learning. In this dataset everything is a theorem as everything is a valid logic rewrite from axioms. Introducing "interesting" theorems (the ones humans wanted to demonstrate in HOL Light) as well as the concept of premises (do not re-demonstrate everything from scratch for each theorem) we were able to generate a post-processed dataset in which each theorem's proof trace ended up with a size amenable to machine learning. The resulting dataset (or a subset) can be visualized using a web interface[4] we designed to help us build intuitions about the structure of these prooftraces.

The next steps of our kernel plan of attack consists in the following:

(i) Train an agent to reconstruct proof steps from premises. The agent should generally be capable of reconstructing ~1k proof steps (there's a good chance we won't need more than that, see (iv)). To train such agents we've reimplemented the HOL Light kernel in python to provide an efficient reinforcement learning environment. We've poured ~0.5 PFLOPS.days at a medium sized LSTM (512 x 2) + TreeLSTMs to embed mathematical terms and reached ~16 proof steps. This is a very encouraging milestone as the curves don't seem to level off. More compute, bigger models, more steps.

(ii) Premise selection. Since we introduced the concept of premises, as we'll aim to demonstrate a new theorem we need to select the premises to start from. This problem has been extensively researched in the past with pretty good results, so we're confident it's within reach.

(iii) Exogenous terms and substitutions. Some proof steps take as argument not only previously demonstrated theorems but also exogenous terms. They can be thought of as variable rewrites in integral calculus. Most of them are "simple parts" of the premises or the target theorems and should be easy to turn into standardized "actions" in our reinforcement learning environment. But few of them seem to appear out of nowhere and sometime "magically” unblock the situation. These are the biggest unknown unknown of the project. We're hopeful generative models can help here.

(iv) Multi-agent search. Armed with (i) (ii) (iii), target N agents towards a new theorem T. They will fail because T is maybe 1m proof steps away--our agents being only capable of producing ~1k steps (as per (i))--but they will potentially cross paths on some theorems as they target T. Take these canonically interesting theorems and insert them the dataset. Retrain the agents. Retarget T. Wash, Dry, Fold, Repeat.

A domain where these techniques can be applied is (quite obviously) mathematical theorem proving. The grail in that space would be the automated demonstration of a conjecture that resists to human mathematicians alone. Such milestone would require the operationalization of all the techniques cited above ((i) to (iv)). A simpler milestone that does not require (iii) consists in automating the formalization of written mathematical proofs as the exogenous terms could be taken as premises directly from the existing proof instead of being generated.

Adjacent to that domain lies another domain with broader industrial implications: automated program verification. Proving the validity of a program being strictly equivalent to demonstrating a mathematical proof, this is also the only verifiable way to ensure a program does not contain a bug or comply to a given specification[5]. Interestingly, we also have strong reasons to believe that this domain is substantially less exposed to the complexities associated with (iii).

Our goal is to explore and pursue the application of advanced automated reasoning to both domains; aiming at assisting and accelerating the formalization of mathematical proofs and extending the domain of application of program verification through higher automation.

[0] https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/

[1] https://openai.com/blog/openai-five-benchmark-results/