Core message
the minimum required takeaway from this talk:
- We have solved voice demos, but we are yet to solve production voice AIs
- There’s a bunch of challenges in building voice AI at the frontier, and the approach to tackling these challenges has remained the same as any other — engineering maturity and good product sense
A visual analogy of scaling a summit will be in the background of the talk.
- It is implied that the summit looks tempting, quite easy to reach, but a lot of challenges lurk when trying to scale it.
- I talk and show how impressive voice AI has become, and how easy it is to make a solid demo
- Then I will show the various failure modes with the demo, and how easy it is to make a bad, frustrating product without enough care
Traps
-
“We just need to improve and find the right prompt”
(an explanation of why playing with the prompts in isolation won’t help. will have fabricated examples)
- listen to real conversations, simulate scenarios, analyse and understand the use case
- annotate the good vs bad in a group
- use that to build an eval system
- only then work on prompts
- otherwise you are playing whac-a-mole. every change will improve some things, and break others. lots of flailing will happen.
-
Prioritising features around the actual conversation experience
- The most important part of making a good voice AI product is to focus most of your time and energy on the conversational experience, and should be prioritised over addition of features and UI affordances. No one wants to talk to a bot that does not understand the user nor achieves what they want, no matter how good the rest of the experience is.
- types of errors that kill anyone wanting to talk to your bot
- slow to respond
- misunderstands user intent
- doesn’t get the user’s job done correctly
-
“Let’s build this complex orchestration/agent handoffs”
(an explanation of why complex solutioning will kill the product. will have fabricated examples)
- tempting to try and engineer complex and clever solutions in an attempt to make reality match the demo
- this makes the system even more inscrutable and harder to improve upon
- this also slows down the time to close the feedback loop that improves the system, and often doesn’t even bring much of an improvement due to the non-determinism driving these applications
What we found helpful:
- observability
- Custom metrics for latencies are fine, but also ensure structured logs for more fine-grained observability
- Some important metrics: time to first byte, initial setup time, tool calls etc
- (these will be explained with examples)
- eval strategy
- Start with unscalable manual work of recording and collecting conversations and listening to them, taking notes on what’s good and what’s not.
- Avoid off-the-shelf and generic LLM-as-a-judge metrics, especially ones that output numeric scores. Instead, use the learnings from the manual evaluations to create a judge that fits the intended behaviour in your domain.
- Create a data flywheel, and regularly get the team together to annotate transcriptions with comments. Do not silo those writing prompts from the engineering team. Developers must be involved in the creation of prompts.
- engineering maturity
- don’t jump to:
- deep workflows, lots of orchestration
- agent handoffs
- RAG + vector indexes/graphs etc
- start with the simplest possible thing and build it up to one of these based on experiments that are checked against the eval