Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow (opens in new tab)

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and lo...

Read the original article