Open Vision Agents by Stream. Build Vision Agents with any model/ video provider.

Open Vision Agents by Stream

Build Vision Agents quickly with any model or video provider.

Video AI: Built for real-time video AI. Combine Yolo, Roboflow and others with gemini/openai realtime
Low Latency: Join quickly (500ms) and low audio/video latency (30ms)
Open: Built by Stream, but use any video edge network that you like
Native APIs: Native SDK methods from OpenAI (create response), Gemini (generate) and Claude (create message). So you can always use the latest LLM capabilities.
SDKs: SDKs for React, Android, iOS, Flutter, React, React Native and Unity.

Created by Stream, uses Stream’s edge network for ultra-low latency.

Examples

Sports Coaching

This example shows you how to build golf coach…

Open Vision Agents by Stream

Build Vision Agents quickly with any model or video provider.

Video AI: Built for real-time video AI. Combine Yolo, Roboflow and others with gemini/openai realtime
Low Latency: Join quickly (500ms) and low audio/video latency (30ms)
Open: Built by Stream, but use any video edge network that you like
Native APIs: Native SDK methods from OpenAI (create response), Gemini (generate) and Claude (create message). So you can always use the latest LLM capabilities.
SDKs: SDKs for React, Android, iOS, Flutter, React, React Native and Unity.

Created by Stream, uses Stream’s edge network for ultra-low latency.

Examples

Sports Coaching

This example shows you how to build golf coaching AI with YOLO and OpenAI realtime. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.

# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
edge=getstream.Edge(),
agent_user=agent_user,
instructions="Read @golf_coach.md",
llm=openai.Realtime(fps=10),
#llm=gemini.Realtime(fps=1), # Careful with FPS can get expensive
processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")],
)

This example shows you how to build golf coaching AI with YOLO and OpenAI realtime. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.

Cluely style Invisible Assistant (coming soon)

Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant. It combines Gemini realtime (to watch your screen and audio), and doesn’t broadcast audio (only text). This approach is quite versatile and can be used for: Sales coaching, job interview cheating, physical world/ on the job coaching with glasses

Demo video

agent = Agent(
edge=StreamEdge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
agent_user=agent_user,  # the user object for the agent (name, image etc)
instructions="You are silently helping the user pass this interview. See @interview_coach.md",
# gemini realtime, no need to set tts, or sst (though that's also supported)
llm=gemini.Realtime()
)

Processors

Processors enable you to provide state and receive/publish video & audio. Many video AI use case require you to do things like

Run a smaller AI model next to the LLM (like Yolo or roboflow)
Make API calls to maintain relevant info/game state
Modify audio/video, for instance avatars
Capture audio/video

This is all handled by processors.

Docs

To get started with Vision Agents, check out our getting started guide at VisionAgents.ai.

Quickstart: Building a Voice AI app
Quickstart: Building a Video AI app
Tutorial: Building realtime sports coaching
Tutorial: Building a realtime meeting assistant

Development

See DEVELOPMENT.md

Awesome Video AI

Our favorite people & projects to follow for vision AI

https://x.com/demishassabis. CEO google deepmind, won a nobel prize
https://x.com/OfficialLoganK. Product lead gemini, posts about robotics vision
https://x.com/ultralytics. various fast vision AI models. Pose, detect objects, segment, classify etc.
https://x.com/skalskip92. roboflow open source lead
https://x.com/moondreamai. the tiny vision model that could
https://x.com/kwindla. pipecat/daily
https://x.com/juberti. head of realtime AI openai
https://x.com/romainhuet head of developer experience openAI
https://x.com/thorwebdev eleven labs
https://x.com/mervenoyann huggingface, quite some posts about Video AI

Inspiration

Livekit Agents: Great syntax, Livekit only
Pipecat: Flexible, but more verbose.
OpenAI Agents: Focused on openAI only

Open Platform

Reach out to nash@getstream.io, and we’ll collaborate on getting you added We’d like to add support for and are reaching out to:

Mediasoup
Janus
Cloudflare
Twilio
AWS IVS
Vonage
And others.

Roadmap

0.1 - First release

Support for >10 out of the box integrations
Support for video processors
Native Stream Chat integration for memory
Support for MCP and function calling for Gemini and OpenAI
Support for realtime WebRTC video and voice with GPT Realtime

Coming Soon

The Python WebRTC lib we use has some limitations. Investigating this.
Hosting & production deploy example
More built-in Yolo processors: Object detection, person detection, etc
Roboflow support
Computer use support
AI avatar support. Tavus etc
QWen3 vision support
Buffered video capture support (enabling AI to capture video when something exciting happens)
Moondream vision

Open Vision Agents by Stream

Examples

Sports Coaching

Open Vision Agents by Stream

Examples

Sports Coaching

Cluely style Invisible Assistant (coming soon)

Processors

Docs

Development

Awesome Video AI

Inspiration

Open Platform

Roadmap

Similar Posts