Time-Frequency Weighted Losses for Phoneme Reconstruction in DNN-Based Speech Enhancement (opens in new tab)

Conventional training losses for speech enhancement based on the signal-to-distortion ratio (SDR) treat all time-frequency (TF) regions uniformly, overlooking the fine-grained spectral cues that are relevant to specific phoneme intelligibility. We propose a TF weighting framework that modulates the SDR objective based on local speech presence, speech-to-interference ratio (SIR), and spectral flux. By integrating these factors into a differenti...

Read the original article