To ensure high-quality data for model training, I have defined 4 core pillars for evaluation. These ensure a balance between technical accuracy and user experience.
| Criterion | What it Measures | Why it Matters (PM Perspective) |
|---|---|---|
| Task Completion | Success of the primary business intent (e.g., booking confirmed). | The "North Star" metric. If the AI fails the task, the interaction has zero utility regardless of other factors. |
| Speech Naturalness | Prosody, pacing, and human-like inflection/tone. | Essential for user trust. Robotic or awkward speech leads to immediate user drop-offs in real-world scenarios. |
| Comprehension | Accurate interpretation of slang, regional accents, and interruptions. | Helps identify if failures are due to "Hearing" (ASR) or "Thinking" (LLM logic). Critical for diverse user bases. |
| Adaptability | Ability to handle non-linear flows or unexpected user responses. | Measures system robustness. A high-quality AI shouldn't "loop" when a user goes off-script. |
https://www.figma.com/design/ZZq7SxRHYpoFiyfS0vcsJP/joshtalk-assignment?node-id=41-2&t=qDbyLuiLpVPcZ5rv-1
The "Architect Evaluation Suite" is designed to minimize cognitive load by separating the "Listening" and "Rating" phases into a clean, two-stage workspace.