A no-BS guide based on actually building real-time two-way voice into a production health tracking app. Every mistake, dead end, and "aha" moment included.


What is Gemini 3.1 Flash Live?

Google's real-time, bidirectional audio AI model. You stream audio in, it streams audio back — like a phone call with AI. It does speech recognition, understanding, reasoning, and voice response ALL in one model. No separate STT → LLM → TTS pipeline needed.

Key specs:

How is it different from regular Gemini?

Regular Gemini models (gemini-2.5-flash, etc.) support generateContent — you send text/images, get text back via REST API. The Live model ONLY supports bidiGenerateContent — bidirectional streaming over WebSocket. You can't use curl or a regular API call. This tripped us up early.


Step 0: Getting Your API Key (Google Cloud Console)

This is where most people get confused. Here's the exact flow:

  1. Go to console.cloud.google.com
  2. Create a project (or select existing one)
  3. Go to APIs & Services → Library → search "Generative Language API" → Enable it
  4. Go to APIs & Services → CredentialsCreate Credentials → API Key