Summary

I have analyzed a popular LLM’s ability to understand residential 2D floorplan images and reason about common tasks in that domain. As a result, common failure modes have been identified, and I propose we generate a synthetic dataset with validated ground truth to be used to train LLMs. In addition, manual QA supervision should be applied to ensure correctness.

LLM tasks

The LLM will be asked to perform the tasks described below and will be evaluated based on the results.

PNG ⇒ JSON

Generate a JSON object that contains:

PNG ⇒ Question answers

Answer specific text questions about the floorplan:

Manual experiment

The LLM tasks were performed manually against a small dataset and the results were analyzed to identify common failure modes.