1. Evaluation Criteria

To ensure high-quality data for model training, I have defined 4 core pillars for evaluation. These ensure a balance between technical accuracy and user experience.

Criterion What it Measures Why it Matters (PM Perspective)
Task Completion Success of the primary business intent (e.g., booking confirmed). The "North Star" metric. If the AI fails the task, the interaction has zero utility regardless of other factors.
Speech Naturalness Prosody, pacing, and human-like inflection/tone. Essential for user trust. Robotic or awkward speech leads to immediate user drop-offs in real-world scenarios.
Comprehension Accurate interpretation of slang, regional accents, and interruptions. Helps identify if failures are due to "Hearing" (ASR) or "Thinking" (LLM logic). Critical for diverse user bases.
Adaptability Ability to handle non-linear flows or unexpected user responses. Measures system robustness. A high-quality AI shouldn't "loop" when a user goes off-script.

2. Evaluator Interface & Workflow

https://www.figma.com/design/ZZq7SxRHYpoFiyfS0vcsJP/joshtalk-assignment?node-id=41-2&t=qDbyLuiLpVPcZ5rv-1

The "Architect Evaluation Suite" is designed to minimize cognitive load by separating the "Listening" and "Rating" phases into a clean, two-stage workspace.

Interface Components:

Evaluator Workflow:

  1. Objective Review: Read the "Core Objective" and Blueprint Checklist to understand success criteria.
  2. Synced Auditing: Play audio (using 1.5x/2x speed if needed) while following the auto-highlighted transcript.
  3. Identify Anomalies: Note specific timestamps where the AI struggled or succeeded.
  4. Structured Scoring: Transition to the rating screen to grade the 4 pillars.
  5. Confidence Check: Set the personal confidence level for the submission.
  6. Submit & Batch: Submit the review; the system automatically loads the next prioritized call.