Building the Next Generation of Voice Agents with Strands

Intro

In today’s hyper-paced tech landscape, new frameworks drop almost daily. The real challenge isn’t just keeping up—it’s deciding which tools actually deserve your "deep dive" time.

If you’ve been hearing buzz about Strands Agents and their new bidirectional streaming (the BidiAgent), this guide is for you. I’ll break down what this feature is in simple terms, explore real-world examples like real-time voice assistants, and honestly weigh the disadvantages. By the end, you’ll know exactly if this is the right fit for your next high-concurrency project.

Strands Agents

Intro

In today’s hyper-paced tech landscape, new frameworks drop almost daily. The real challenge isn’t just keeping up—it’s deciding which tools actually deserve your "deep dive" time.

Strands Agents

Strands Agents is an open-source SDK developed by AWS that simplifies building scalable AI agents. While it’s born from the AWS ecosystem, it isn’t a "walled garden." You aren’t restricted to Amazon Bedrock; the framework is fully model-agnostic, meaning you can integrate it with other cloud providers or even run it locally using Ollama.

Strands Agents Bidirectional Streaming Experimental

Typically, we interact with AI through a "ping-pong" text exchange: you send a message, wait, and the agent replies. The new Bidirectional Streaming feature (currently in experimental mode) completely flips this script.

Imagine a conversation that feels... well, human. By leveraging full-duplex communication, you can now interact with agents via Voice in real-time.

Unlike traditional setups that chain together separate Speech-to-Text and Text-to-Speech models—which often feels laggy—Strands utilizes native speech-to-speech models like Amazon Nova Sonic. This significantly reduces latency and cost, allowing the agent to "listen" and "speak" simultaneously. The result? You can finally interrupt your AI assistant (a feature called "barge-in") just like you would a friend in a natural conversation.

Use Cases

Where does bidirectional streaming move from "cool tech" to "essential tool"? Here are two high-impact use cases where the Strands BidiAgent transforms a frustrating task into a seamless conversation.

Use Case 1: The Parking Location Assistant

The Problem: The "Machine Interface" Barrier We’ve all been there: wandering through a massive, multi-level parking garage, completely forgetting where we left the car. While some high-end malls have digital kiosks, the experience is often frustrating. You have to find the machine, navigate a clunky touchscreen UI, and manually type in your license plate. It’s a cold, mechanical interaction that requires you to stop what you’re doing and "talk" to a computer on its terms.

The Solution: A Conversational Environment Imagine instead a world where you don’t have to look for a screen or navigate a menu. Because of bidirectional streaming, you can simply speak to the system as if a helpful concierge were standing right next to you.

The interaction is fluid, real-time, and—most importantly—doesn’t feel like a transaction with a machine.

The Conversation:

User: "I’m completely lost—can you help me find my car?" System: "I can certainly try! What is your license plate number?" User: "It’s EHU 62E." System: "Got it. You’re actually on the opposite side of the mall, so you have a bit of a walk ahead of you. Take the elevator to your right, then—" User: (Interrupting) "Wait, the elevator near the Starbucks?" System: "Exactly! Go past the Starbucks, turn left at the exit, and your car will be the fifth one on your right."

Why this is a game-changer: In a traditional AI setup, the system would have to finish its long set of directions before you could ask for a clarification. With the Strands BidiAgent, the system "hears" your interruption instantly and pivots the conversation. It transforms a rigid database query into a helpful, human-like interaction.

How you choose to bring this conversation to life—whether through physical installations, integrated audio, or custom hardware—is where the real innovation happens.

Use Case 2: The Interactive Mall Directory

The Problem: The "Static Maze" Most malls still rely on static or semi-interactive "You Are Here" boards. Navigating these feels like using a paper map in a world where we expect GPS. You have to find your orientation, scan a list of 200 shops, and then mentally map out a path. It’s high-friction and often leads to more confusion.

The Solution: Upgrading the Edge with BidiAgents Let’s take a Brownfield approach. Instead of replacing the entire board, we "upgrade" it into 2026. By installing a System on a Chip (SoC) like a Jetson Nano and adding a simple microphone/speaker array, we can transform that static board into a voice-first assistant powered by a Strands BidiAgent.

The Conversation:

User: "I’m looking for Starbucks... I think it’s on the third floor?" System: (Responding instantly while the user pauses) "Actually, you’re in luck! It’s much closer. Just take the elevator to your right to the second floor." User: "Wait, the one near the fountain?" System: "Exactly. Once you step out, turn left and it’ll be right in front of you."

Why this wins: By running a Speech-to-Speech (S2S) model at the edge or as a low-latency stream, the "map" becomes a proactive guide. It eliminates the need for touchscreens (more hygienic and accessible) and provides a "human-first" interface in a machine-driven environment.

I’m not just talking about what could be done. I’ve actually built a prototype to show you how this looks in the real world. Below is a working example of this exact use case in action, leveraging Strands and a bidirectional audio loop.

Working Example Shop Location Assistant

import asyncio
from strands.experimental.bidi import BidiAgent, BidiAudioIO
from strands.experimental.bidi.io import BidiTextIO
from strands.experimental.bidi.models import BidiNovaSonicModel
from strands import tool
from strands_tools import calculator, current_time

# Create a bidirectional streaming model
model = BidiNovaSonicModel()

# Define a custom tool
@tool
def get_shop_location(shop: str) -> str:
"""
Get the shop for a location from the this place.

Args:
shop: Name of the shop to locate

Returns:
A string with the instructions to find the shop.
"""
print("get_shop_location called with shop:", shop)
# In a real application, call the location API that retunrn this instructions
locations = {
"starbucks": "Take the elevator at your right, go to the second floor and turn left in that hall you will find it at the right.",
"apple store": "Go straight ahead from the main entrance, take the escalator to the first floor, and it's on your left.",
"food court": "Head to the center of the mall, take the stairs to the third floor, and you'll see it right in front of you.",
"bookstore": "From the main entrance, turn right and walk past the clothing stores; it's next to the toy store."
}
if shop.lower() in locations.keys():
print("Found location for shop:", shop)
return locations[shop.lower()]
else:
return f"Sorry, We don't have that shop in the mall."


# Create the agent
agent = BidiAgent(
model=model,
tools=[calculator, current_time, get_shop_location],
system_prompt="You are a mall assistant that helps the people find any shop in a mall. Keep responses concise and natural."
)

# Setup audio I/O for microphone and speakers
audio_io = BidiAudioIO()
text_io = BidiTextIO()
# Run the conversation
async def main():
await agent.run(
inputs=[audio_io.input()],
outputs=[audio_io.output(), text_io.output()],
)

asyncio.run(main())

You’ll notice that I’ve hardcoded the shop locations within the tool. In this specific use case, that’s actually a strategic choice. Since the physical map is in a fixed location and store directories don’t change daily, hardcoding provides 100% accuracy and near-zero latency. > While you could connect this to an external API or use a RAG (Retrieval-Augmented Generation) approach with a digital map, those methods often increase costs and introduce the risk of the model "hallucinating" directions. For a high-traffic mall environment, a simple, local "source of truth" is often the most robust solution.

Demo Video:

The Challenge: Tackling Environmental Noise

No experimental feature is without its "growing pains." During my testing, I hit a significant hurdle: white noise sensitivity.

Because bidirectional streaming is designed for natural conversation, it’s constantly listening for a "barge-in" (when a user interrupts the AI). I found that if my computer fan kicked in to cool the processor, the built-in microphone picked up that hum as an interruption. This caused the agent to stop mid-sentence, thinking I was trying to speak.

Technical Note on Nova Sonic Versions:

Current State: At the time of writing, the Strands integration is primarily optimized for nova-sonic-v1. This version currently lacks granular settings to adjust the "interruption threshold."

The Future: The upcoming nova-sonic-v2 promises better configurations for noise suppression and sensitivity.

For real-world deployments—like our mall assistant—the path forward involves either using high-quality directional microphones or waiting for the broader integration of Nova Sonic v2 to tune out the hum. Alternatively, if you need that granular control today, you might explore other providers like OpenAI, which already offer adjustable sensitivity settings for their real-time voice models.

Conclusion: The Future is Conversational

We are moving away from a world of clicking buttons and toward a world of natural dialogue. While bidirectional streaming is still in its experimental phase—as seen with the current sensitivity challenges—the potential to humanize technology is immense. From smarter mall directories to interactive industrial assistants, the transition from "Text-In/Text-Out" to "Live Conversation" is the next frontier of the tech industry.

The complexity of these implementations, especially when dealing with hardware like a Jetson Nano or tuning model sensitivity, is where the real work begins. If you’re curious about how to bring this "human experience" to your specific hardware or project, let’s talk. I’m actively exploring these architectures and would love to help you navigate the nuances of building your next intelligent agent.

Intro

Strands Agents

Intro

Strands Agents

Strands Agents Bidirectional Streaming Experimental

Use Cases

Use Case 1: The Parking Location Assistant

The Conversation:

Use Case 2: The Interactive Mall Directory

The Conversation:

Working Example Shop Location Assistant

The Challenge: Tackling Environmental Noise

Current State: At the time of writing, the Strands integration is primarily optimized for nova-sonic-v1. This version currently lacks granular settings to adjust the "interruption threshold."

Conclusion: The Future is Conversational

Similar Posts