How to build a real-time voice agent with Gemini and Google ADK
Ashwini Kumar
Solution Acceleration Architect
Neeraj Agrawal
Solution Acceleration Architect
Building advanced conversational AI has moved well beyond text.
Now, we can use AI to create real-time, voice-driven agents. However, these systems need low-latency, two-way communication, real-time information retrieval, and the ability to handle complex tasks. This guide shows you how to build one using Gemini and the Google Agent Development Kit (ADK). You’ll learn how to create an intelligent, responsive voice agent.
The foundational agent
First, we create an agent with a persona but no access to external tools. This is the simplest agent, relying only on its pre-trained knowledge. It's a great starting point.
This agent can chat, but it lacks access to external information.
The advanced agent
To make the agent useful, we add tools. This lets the agent access live data and services. In streaming_service.py, we give the agent access to Google Search and Google Maps.
A closer look at the tools
-
Google Search: This pre-built ADK tool lets your agent perform Google searches to answer questions about current events and real-time information.
-
MCP Toolset for Google Maps: This uses the Model Context Protocol (MCP) to connect your agent to a specialized server (in this case, one that understands the Google Maps API). The main agent acts as an orchestrator, delegating tasks it can't handle to specialist tools.
Engineering a natural conversation
The RunConfig object defines how the agent communicates. It controls aspects like voice selection and streaming mode.
StreamingMode.BIDI (bi-directional) enables users to interrupt the agent, creating a more natural conversation.
The asynchronous core
Real-time voice chats require handling multiple tasks concurrently: listening, thinking, and speaking. Python's asyncio and TaskGroup manage these parallel tasks.
Translating the agent's voice
The receive_service_responses task processes the agent's output before sending it to the user. This output includes audio and text transcription.
Handling audio
Audio is handled using Base64 encoding to convert binary data into a text string for transmission.
Handling text
Text transcription is streamed for real-time feedback.