By Le Zhuo and Sayak Paul
A few months before the release of Nano Banana 2, we introduced StructBench — a benchmark for evaluating models on non-natural images like diagrams, math figures, charts, and documents.
Our motivation was simple: today’s image models are overly optimized for aesthetics, but struggle with factuality + structural reasoning. If we want truly unified multimodal models, the training mix needs non-natural data too.
We tested the state of the art at the time—including Nano Banana 1.0 and gpt-image—and all performed surprisingly poorly on StructBench.
Nano Banana 2 (NB2) just dropped, and its improvements strongly validate this direction 🤯
It achieves 90+ on our image generation tasks—by far the best we’ve seen.
But NB2 still isn’t perfect: we still find failure cases where it misinterprets instructions or misses structural details.
Excited to see the field moving toward models that reason as well as they render.

We zoomed in a bit category-wise. It still seems to be doing quite poorly on charts, while excelling significantly on math figures and tables!

Below, we provide some examples of both failure and success.
Instruction: Move the dashed line so it starts at vertex B instead of vertex A, keeping its other endpoint at F.
NB2 does the right move but connects c and f wrongly.


Instruction: Add a new category 'Returns' with a value of 2100 to the data and labels.
NB2 adds a part but wrong ratio and size.