Streaming Video Instruction Tuning

Title:Streaming Video Instruction Tuning

Abstract:We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After t…

Title:Streaming Video Instruction Tuning

View PDF

Abstract:We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.


Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2512.21334 [cs.CV]
	(or arXiv:2512.21334v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2512.21334 arXiv-issued DOI via DataCite

Submission history

From: Jiaer Xia [view email] [v1] Wed, 24 Dec 2025 18:59:36 UTC (21,950 KB)

Title:Streaming Video Instruction Tuning

Title:Streaming Video Instruction Tuning

Submission history

Similar Posts