ChronusOmni: Improving Time Awareness of Omni Large Language Models

Title:ChronusOmni: Improving Time Awareness of Omni Large Language Models

Abstract:Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities–for example, identifying what is visually present when a character speaks, or determining what is said when a visual…

Title:ChronusOmni: Improving Time Awareness of Omni Large Language Models

View PDF HTML (experimental)

Abstract:Time awareness is a fundamental ability of omni large language models, especially for understanding long videos and answering complex questions. Previous approaches mainly target vision-language scenarios and focus on the explicit temporal grounding questions, such as identifying when a visual event occurs or determining what event happens at aspecific time. However, they often make insufficient use of the audio modality, and overlook implicit temporal grounding across modalities–for example, identifying what is visually present when a character speaks, or determining what is said when a visual event occurs–despite such cross-modal temporal relations being prevalent in real-world scenarios. In this paper, we propose ChronusOmni, an omni large language model designed to enhance temporal awareness for both explicit and implicit audiovisual temporal grounding. First, we interleave text-based timestamp tokens with visual and audio representations at each time unit, enabling unified temporal modeling across modalities. Second, to enforce correct temporal ordering and strengthen fine-grained temporal reasoning, we incorporate reinforcement learning with specially designed reward functions. Moreover, we construct ChronusAV, a temporally-accurate, modality-complete, and cross-modal-aligned dataset to support the training and evaluation on audiovisual temporal grounding task. Experimental results demonstrate that ChronusOmni achieves state-of-the-art performance on ChronusAV with more than 30% improvement and top results on most metrics upon other temporal grounding benchmarks. This highlights the strong temporal awareness of our model across modalities, while preserving general video and audio understanding capabilities.


Comments:	Code available at this https URL
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2512.09841 [cs.CL]
	(or arXiv:2512.09841v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2512.09841 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yijing Chen [view email] [v1] Wed, 10 Dec 2025 17:22:42 UTC (2,029 KB)

Title:ChronusOmni: Improving Time Awareness of Omni Large Language Models

Title:ChronusOmni: Improving Time Awareness of Omni Large Language Models

Submission history

Similar Posts