JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

View PDF HTML (experimental)

Abstract:We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model perfor…

View PDF HTML (experimental)

Abstract:We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.


Comments:	UniReps: Unifying Representations in Neural Models (NeurIPS 2025 Workshop)
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2512.07168 [cs.SD]
	(or arXiv:2512.07168v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2512.07168 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Georgios Ioannides [view email] [v1] Mon, 8 Dec 2025 05:01:51 UTC (786 KB)

Submission history

Similar Posts