Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
arxiv.org·6d
🎧Vorbis Encoding
Preview
Report Post

View PDF HTML (experimental)

Abstract:This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder t…

Similar Posts

Loading similar posts...