I have analyzed a popular LLM’s ability to understand residential 2D floorplan images and reason about common tasks in that domain. As a result, common failure modes have been identified, and I propose we generate a synthetic dataset with validated ground truth to be used to train LLMs. In addition, manual QA supervision should be applied to ensure correctness.
The LLM will be asked to perform the tasks described below and will be evaluated based on the results.
Generate a JSON object that contains:
Answer specific text questions about the floorplan:
The LLM tasks were performed manually against a small dataset and the results were analyzed to identify common failure modes.