Connect With Me

If you found this project interesting, feel free to check out more of my work.

⭐ Follow me on GitHub for more AI, LLM, and voice agent projects.

Introduction

Voice AI has made incredible progress in recent years. Large Language Models can reason, summarize, and solve complex problems with impressive accuracy. However, when it comes to real-time voice conversations, most enterprise voice bots still feel robotic.

One of the biggest problems is latency.

In natural human conversations, pauses longer than 300 milliseconds start to feel awkward. Unfortunately, many modern voice assistants still take 1–2 seconds to respond after a user finishes speaking.

The MNK Voice Agent Suite was designed to solve this problem.

Our goal was to build a full-duplex conversational AI system capable of holding natural phone conversations with sub-300ms response latency, while also providing enterprise-grade transparency and cost monitoring.

The Problem with Traditional Voice Bots

Despite improvements in AI reasoning, most enterprise voice systems suffer from three major issues:

1. Conversational Latency

Most systems wait for a full speech transcript before sending it to an LLM. This introduces delays that break conversational flow.

2. Interrupt Handling

Traditional voice assistants cannot handle barge-ins well. If a user interrupts the AI, the system often continues speaking or crashes.

3. Cost Transparency

Many companies using AI voice systems struggle with “black box billing.”
Costs from Speech-to-Text, LLM inference, and Text-to-Speech providers are difficult to track in real time.

The MNK Voice Agent Suite addresses these issues through a streaming-first architecture.

What is the MNK Voice Agent Suite?

The MNK Voice Agent Suite is a real-time conversational AI platform designed for enterprise telephony environments.

It enables AI agents to conduct fluid, human-like conversations over phone calls using a combination of advanced speech models and large language models.

The platform integrates three main components:

Deepgram for real-time Speech-to-Text
Gemini 2.0 Flash for reasoning and response generation
ElevenLabs for natural voice synthesis

Together, these components create a conversational loop that feels fast, responsive, and human-like.

Core Features

Hyper-Realistic Voice Conversations

The system uses Gemini 2.0 Flash as the reasoning engine. Gemini processes the conversation context and generates natural responses while maintaining awareness of the ongoing dialogue.

Responses are converted to speech using ElevenLabs' expressive voice synthesis, allowing the AI to speak with natural pauses and emotional tone.

Sub-300ms Response Latency

One of the key goals of the project was to break the 300 millisecond latency barrier.

By optimizing streaming pipelines and WebSocket buffers, the system achieves a Time To First Byte (TTFB) of around 280 milliseconds.

This creates the illusion of instant conversational response, similar to human interactions.

Smart Barge-In Handling

A natural conversation requires the ability to interrupt and redirect dialogue.

The MNK Voice Agent uses Voice Activity Detection (VAD) to monitor incoming audio streams. When the user begins speaking while the AI is talking, the system immediately stops its speech output and shifts to listening mode.

This prevents awkward overlaps and keeps the conversation natural.

Human-in-the-Loop Support

Even advanced AI systems sometimes struggle with complex customer situations.

To address this, the platform includes a live monitoring dashboard that shows real-time call transcripts.

If the AI encounters difficulty, a human operator can click “Take Over”, instantly replacing the AI voice with their own without disrupting the call.

The transition is designed to be seamless and invisible to the caller.

Real-Time Cost Tracking

AI voice systems rely on multiple external APIs, each with different billing models.

The MNK Voice Agent Suite includes a granular cost auditing system that calculates the exact cost of each conversation using the formula:

C_total = C_STT(t) + C_LLM(tokens_in + tokens_out) + C_TTS(characters)

This approach allows businesses to see precisely how much each call costs per minute, improving financial transparency and system optimization.

System Architecture

Instead of relying on traditional REST APIs, the MNK Voice Agent Suite uses a fully streaming architecture.

The architecture consists of several key components.

Orchestrator Layer

A FastAPI-based middleware acts as the central orchestrator. Running on Google Cloud Run, it manages WebSocket connections from telephony systems such as Twilio Media Streams.

This layer coordinates communication between speech recognition, the LLM, and speech synthesis.

Speech Recognition (Deepgram)

Incoming audio is streamed directly to Deepgram Nova-2, which performs real-time speech recognition.

The system uses 200ms endpoint detection, allowing the AI to quickly detect when the user has finished speaking.

AI Reasoning (Gemini 2.0 Flash)

Transcribed text is sent to Gemini 2.0 Flash using the Vertex AI streaming SDK.

Gemini generates responses token-by-token, allowing downstream systems to begin generating speech before the full response is complete.

Speech Generation (ElevenLabs)

The generated text is streamed to ElevenLabs Turbo v2.5, which converts text into expressive speech.

Instead of waiting for complete sentences, the system streams small text chunks, allowing speech to begin almost immediately.

Live Dashboard

A monitoring dashboard built with Next.js 15 provides real-time insights into active calls.

The system uses Server Actions to fetch call states from AlloyDB, enabling near-instant UI updates without heavy client-side polling.

Key Engineering Challenges

Managing the Latency Budget

To maintain realistic conversations, the entire system had to remain within a 300ms latency budget.

Typical breakdown:

Network Round Trip: ~50ms
Speech Recognition: ~15ms
LLM First Token: ~150ms
Speech Generation: ~80ms

Total: ~295ms

Any network jitter could break the illusion of real-time interaction.

To mitigate this, we implemented speculative filler phrases such as “Hmm…” or “Let me check that,” buying time during occasional latency spikes.

Echo Cancellation

One challenge was preventing the system from transcribing its own voice output.

Without proper suppression, the AI would hear itself speaking and generate responses to its own words.

The solution was implementing strict echo suppression logic in the orchestrator layer, temporarily ignoring incoming audio while the TTS stream is active.

Conversation State Management

Because Cloud Run is stateless, maintaining conversation history across WebSocket reconnections was challenging.

We solved this by storing conversation context in Redis, ensuring the AI retains memory of the conversation even if connections are interrupted.

Achievements

The MNK Voice Agent Suite achieved several important milestones:

Stable ~280ms response latency
Seamless AI-to-human call handoff
Detailed real-time cost tracking
Full streaming conversational pipeline

These improvements significantly enhance the usability of AI voice agents in enterprise environments.

Key Lessons

Developing a real-time voice agent revealed several important insights.

Streaming is essential. Real-time voice AI cannot function effectively using traditional request-response APIs.

Voice Activity Detection is critical. The ability to know when to stop speaking matters more than voice quality itself.

Prompt design must be voice-friendly. LLMs tend to generate structured text like bullet points, which sounds unnatural when spoken aloud. Prompts must encourage concise, conversational language.

Future Work

The MNK Voice Agent Suite will continue evolving with several planned features.

Multimodal Vision Support

Future versions will allow users to show the AI a video feed during support calls using Gemini’s vision capabilities.

Emotion-Adaptive Voice

The system will analyze user sentiment and dynamically adjust the voice tone to match the emotional context of the conversation.

Large-Scale Outbound Campaigns

The platform will be expanded to support thousands of simultaneous outbound calls for automated appointment scheduling and customer outreach.

Conclusion

Real-time conversational AI represents the next major step in human-computer interaction.

By combining streaming speech recognition, fast LLM reasoning, and expressive voice synthesis, the MNK Voice Agent Suite demonstrates how AI systems can move beyond scripted responses and begin to participate in natural, fluid conversations.

Breaking the 300ms latency barrier is not just a technical milestone—it is a crucial step toward making AI communication feel truly human.

Breaking the 300ms Barrier: Building a Real-Time AI Voice Agent with Gemini