A no-BS guide based on actually building real-time two-way voice into a production health tracking app. Every mistake, dead end, and "aha" moment included.
Google's real-time, bidirectional audio AI model. You stream audio in, it streams audio back — like a phone call with AI. It does speech recognition, understanding, reasoning, and voice response ALL in one model. No separate STT → LLM → TTS pipeline needed.
Key specs:
v1alpha (NOT v1beta — critical!)gemini-3.1-flash-live-previewHow is it different from regular Gemini?
Regular Gemini models (gemini-2.5-flash, etc.) support generateContent — you send text/images, get text back via REST API. The Live model ONLY supports bidiGenerateContent — bidirectional streaming over WebSocket. You can't use curl or a regular API call. This tripped us up early.
This is where most people get confused. Here's the exact flow: