2. Quality Assurance Framework for Multi-Dialect Voice AI

1. Evaluation Criteria

To ensure high-quality data for model training, I have defined 4 core pillars for evaluation. These ensure a balance between technical accuracy and user experience.

Criterion	What it Measures	Why it Matters (PM Perspective)
Task Completion	Success of the primary business intent (e.g., booking confirmed).	The "North Star" metric. If the AI fails the task, the interaction has zero utility regardless of other factors.
Speech Naturalness	Prosody, pacing, and human-like inflection/tone.	Essential for user trust. Robotic or awkward speech leads to immediate user drop-offs in real-world scenarios.
Comprehension	Accurate interpretation of slang, regional accents, and interruptions.	Helps identify if failures are due to "Hearing" (ASR) or "Thinking" (LLM logic). Critical for diverse user bases.
Adaptability	Ability to handle non-linear flows or unexpected user responses.	Measures system robustness. A high-quality AI shouldn't "loop" when a user goes off-script.

2. Evaluator Interface & Workflow

https://www.figma.com/design/ZZq7SxRHYpoFiyfS0vcsJP/joshtalk-assignment?node-id=41-2&t=qDbyLuiLpVPcZ5rv-1

The "Architect Evaluation Suite" is designed to minimize cognitive load by separating the "Listening" and "Rating" phases into a clean, two-stage workspace.

Interface Components:

Synced Transcript & Player: High-fidelity waveform with real-time text highlighting. Evaluators can click any sentence to jump the audio to that timestamp.
Blueprint Checklist: A dedicated panel showing the "Core Objective" and required logic steps to keep the evaluator grounded in the call's goal.
Semantic Rating Sliders: 1-5 star scales that display text labels (e.g., "4 = Proficient") to remove numerical ambiguity.
Reviewer Confidence Toggle: A Low/Medium/High selector to flag reviews that might need a secondary audit.

Evaluator Workflow:

Objective Review: Read the "Core Objective" and Blueprint Checklist to understand success criteria.
Synced Auditing: Play audio (using 1.5x/2x speed if needed) while following the auto-highlighted transcript.
Identify Anomalies: Note specific timestamps where the AI struggled or succeeded.
Structured Scoring: Transition to the rating screen to grade the 4 pillars.
Confidence Check: Set the personal confidence level for the submission.
Submit & Batch: Submit the review; the system automatically loads the next prioritized call.