By Le Zhuo and Sayak Paul

A few months before the release of Nano Banana 2, we introduced StructBench — a benchmark for evaluating models on non-natural images like diagrams, math figures, charts, and documents.

Our motivation was simple: today’s image models are overly optimized for aesthetics, but struggle with factuality + structural reasoning. If we want truly unified multimodal models, the training mix needs non-natural data too.

We tested the state of the art at the time—including Nano Banana 1.0 and gpt-image—and all performed surprisingly poorly on StructBench.

Nano Banana 2 (NB2) just dropped, and its improvements strongly validate this direction 🤯

It achieves 90+ on our image generation tasks—by far the best we’ve seen.

But NB2 still isn’t perfect: we still find failure cases where it misinterprets instructions or misses structural details.

Excited to see the field moving toward models that reason as well as they render.

image.png

We zoomed in a bit category-wise. It still seems to be doing quite poorly on charts, while excelling significantly on math figures and tables!

image.png

Below, we provide some examples of both failure and success.

Instruction: Move the dashed line so it starts at vertex B instead of vertex A, keeping its other endpoint at F.

NB2 does the right move but connects c and f wrongly.

3f5abf9b-0e47-4ae2-ae12-0ffddadda62e.jpeg

c68f082e-ec77-45d0-a658-52a981a22b5b.png

Instruction: Add a new category 'Returns' with a value of 2100 to the data and labels.

NB2 adds a part but wrong ratio and size.