Bootstrapping Correctness: Teaching an LLM to Formally Verify its Own Code

Bug-free programs are the holy grail of software engineering. Formal verification offers this promise. By representing a program and its specification in formal logic, a theorem prover can certify its full correctness — something that no amount of testing can ever achieve. The primary obstacle has always been cost. The process requires elite specialists (think PhD-level math) to manually write intricate proofs, an effort so intractable that it is reserved for only the most critical systems, like microkernels or flight control software. This didn’t scale — until now.

We’re focused on breaking this bottleneck. The goal is to make formal verification a standard, scalable part of the software development lifecycle — what testing is today — rather than a prohibitively expensive luxury.

This post focuses on the core mechanism we use to teach our AI to write proofs. To explain this, we must first abstract away some of the complexity.

How Verifiers Work

Our verifier system is built on the basis of Isabelle. It can be treated as an oracle—a black box. It takes a triplet as input: a Specification (specifying properties of the code — see further), the Code itself, and a Proof written in Isar, the proof language of the Isabelle theorem prover.

The Specification can range from a single property, e.g. “no buffer overflows” or “all functions and loops always terminate” to a complete definition of what the code should be doing in the formal language. Some properties sound incredibly simple, but don’t be fooled: they’re actually incredibly powerful and notoriously hard to test with usual methods. The 2024 Crowdstrike-related Windows outage (>$5B in damages) was caused by an out-of-bounds memory read — something a formal verification oracle promises to prevent.

The oracle outputs whether the Proof is correct for the given Code and Specification. For incorrect proofs, the oracle can often provide useful feedback on the point of failure.

The internal mechanics of the C-to-Isabelle translation, the formal semantics of C, and the theorem-proving strategies are complex topics we will detail in future posts. (For further reading on a landmark project in this field, see the seL4 microkernel page — we used the sel4 microkernel code in this project to refine our AI).

Generating Proofs with LLMs

A naive LLM, even one fine-tuned on programming languages, usually cannot generate a valid formal proof on the first attempt. It seems that there is simply not enough Isar proofs for C programs in all modern LLMs’ training sets. And the Isar prover language is as unforgiving as you would expect: a single incorrect step invalidates the entire argument. The key, therefore, is not single-shot generation, but rapid, oracle-guided iteration, where we have our LLM refine not just the proof, but also the code.

main prover loop.drawio.png

We let the model generate an initial proof attempt. The oracle immediately checks it. If it fails, the failure feedback is used to inform the next attempt, which can result in the model changing either the proof or the code. The code changes allowed are strictly limited to a few predefined operations, e.g. replacing a for loop with a while loop, and the model doesn’t actually generate code — it just chooses which change to apply, if any. This ensures that the semantics of the code are unchanged. This cycle of generation and verification repeats, typically for a handful of iterations, allowing the AI to rapidly refine its own work and converge on a correct proof.

This iterative refinement helps a non-finetuned LLM improve upon its proofs. But our larger goal is to create an AI that gets systematically better over time. This is where our self-supervised learning architecture comes into play. The final output of the process described above—whether a success or a failure—becomes training data for future models.

The Refinement Flywheel

When the AI successfully produces a valid proof, the entire (Specification, Code, Proof) triplet is stored in our "Golden Dataset." This is a grownig library of perfect, machine-verified examples of correctness. When the AI fails to produce a valid proof after all its iterations, the final failed attempt and a summary of the approaches are stored in our "Failure Analysis Dataset."

Training Loop.drawio.png

These two datasets form a perpetual, self-improving feedback loop. They are used to continually refine the base LLM. Unlike typical datasets scraped from the web, ours are perfectly labeled by a logical oracle. The Golden Dataset provides high-signal examples of success, while the Failure Analysis Dataset teaches the model the common failure modes.

This process of Continual Learning is how we bootstrap correctness. AI_n generates some new verified training data and some new failures. That data is used to produce AI_n+1, which is more capable and can solve more complex verification tasks. This, in turn, produces even more valuable data, accelerating the flywheel.

In the future posts, we’ll explore how exactly this continual learning pipeline is set up. Stay tuned!