Jump to Content
AI & Machine Learning

How to build a real-time voice agent with Gemini and Google ADK

August 21, 2025
Ashwini Kumar

Solution Acceleration Architect

Neeraj Agrawal

Solution Acceleration Architect

Try Gemini 2.5

Our most intelligent model is now available on Vertex AI

Try now

Building advanced conversational AI has moved well beyond text.

Now, we can use AI to create real-time, voice-driven agents. However, these systems need low-latency, two-way communication, real-time information retrieval, and the ability to handle complex tasks. This guide shows you how to build one using Gemini and the Google Agent Development Kit (ADK). You’ll learn how to create an intelligent, responsive voice agent.

The foundational agent

First, we create an agent with a persona but no access to external tools. This is the simplest agent, relying only on its pre-trained knowledge. It's a great starting point.

Loading...

This agent can chat, but it lacks access to external information.

The advanced agent

To make the agent useful, we add tools. This lets the agent access live data and services. In streaming_service.py, we give the agent access to Google Search and Google Maps.

Loading...

A closer look at the tools

  • Google Search: This pre-built ADK tool lets your agent perform Google searches to answer questions about current events and real-time information.

  • MCP Toolset for Google Maps: This uses the Model Context Protocol (MCP) to connect your agent to a specialized server (in this case, one that understands the Google Maps API). The main agent acts as an orchestrator, delegating tasks it can't handle to specialist tools.

Engineering a natural conversation

The RunConfig object defines how the agent communicates. It controls aspects like voice selection and streaming mode.

Loading...

StreamingMode.BIDI (bi-directional) enables users to interrupt the agent, creating a more natural conversation.

The asynchronous core

Real-time voice chats require handling multiple tasks concurrently: listening, thinking, and speaking. Python's asyncio and TaskGroup manage these parallel tasks.

Loading...

Translating the agent's voice

The receive_service_responses task processes the agent's output before sending it to the user. This output includes audio and text transcription.

Handling audio

Audio is handled using Base64 encoding to convert binary data into a text string for transmission.

Loading...

Handling text

Text transcription is streamed for real-time feedback.

Loading...

Posted in