Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, Yong Dou This work was supported by National University of Defense Technology ZZCX-ZZGC-01-04 (Corresponding author: Kele Xu). Yi Su, Qisheng Xu, Kele Xu, and Yong Dou are with the College of Computer Science and Technology, Changsha, China. Jisheng Bai is with the School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China.
Abstract
Audio-Language Models (ALMs), which are trained on audio-text data, focus on the processing, understanding, and reasoning of sounds. Unlike traditional supervised learning approaches learning from predefined labels, ALMs utilize natural language as a supervision signal, which is more suitable for describing complex real-world audio recordings. ALMs demonstrate strong ze…
Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, Yong Dou This work was supported by National University of Defense Technology ZZCX-ZZGC-01-04 (Corresponding author: Kele Xu). Yi Su, Qisheng Xu, Kele Xu, and Yong Dou are with the College of Computer Science and Technology, Changsha, China. Jisheng Bai is with the School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China.
Abstract
Audio-Language Models (ALMs), which are trained on audio-text data, focus on the processing, understanding, and reasoning of sounds. Unlike traditional supervised learning approaches learning from predefined labels, ALMs utilize natural language as a supervision signal, which is more suitable for describing complex real-world audio recordings. ALMs demonstrate strong zero-shot capabilities and can be flexibly adapted to diverse downstream tasks. These strengths not only enhance the accuracy and generalization of audio processing tasks but also promote the development of models that more closely resemble human auditory perception and comprehension. Recent advances in ALMs have positioned them at the forefront of computer audition research, inspiring a surge of efforts to advance ALM technologies. Despite rapid progress in the field of ALMs, there is still a notable lack of systematic surveys that comprehensively organize and analyze developments. This deficiency not only limits researchers’ comprehensive understanding and evaluation of existing technologies but also hinders the rapid adoption and improvement of new methods. In this paper, we present a comprehensive review of ALMs with a focus on general audio tasks, aiming to fill this gap by providing a structured and holistic overview of ALMs. Specifically, we cover: (1) the background of computer audition and audio-language models; (2) the foundational aspects of ALMs, including prevalent network architectures, training objectives, and evaluation methods; (3) foundational pre-training and audio-language pre-training approaches; (4) task-specific fine-tuning, multi-task tuning and agent systems for downstream applications; (5) datasets and benchmarks; and (6) current challenges and future directions. Our review provides a clear technical roadmap for researchers to understand the development and future trends of existing technologies, offering valuable references for implementation in real-world scenarios.
Index Terms:
Multimodal Machine Learning, Audio-language Model, Pre-training, Downstream Transfer.
I Introduction
Enabling machines to hear like humans and process audio-centric tasks has long been a significant challenge [1]. Audio-Language Models (ALMs), which are trained on audio-text data, focus on the processing, understanding, and reasoning of sounds. This area is emerging as a prominent research field at the intersection of audio processing and Natural Language Processing. ALMs are not only applicable to basic audio tasks, such as audio classification [2], but also show great potential for more complicated scenarios. These include tasks such as audio-text retrieval [3], audio generation [4], automatic audio captioning [5], audio source separation [6], automatic speech translation [7], and audio chatbots [8].
In contrast to audio representation learning based on labeled data for specific tasks, ALM can learn from more descriptive textual information, expanding the scope of supervision to include human-annotated captions and readily available titles and descriptions from web sources [9]. Natural language is well-suited for characterizing real-world audio, which frequently involves multiple overlapping sound events, thereby enabling models to learn their intrinsic relationships [10]. Furthermore, using natural language as supervision avoids the model’s reliance on task-specific predefined labels, enhancing the potential for models to generalize effectively to open-world scenarios.
As large language models (LLMs) exhibit remarkable comprehension capabilities, researchers have explored their integration as guiding components within ALMs. However, pre-trained LLMs still face challenges in generalizing across a broad spectrum of downstream tasks [11], necessitating additional transfer steps such as post-training and collaboration with other foundational models. Within this research landscape, language provides a unified mechanism for constructing instances, enabling LLMs to undergo instruction tuning and in-context learning across diverse tasks. This approach bridges the gap between auditory information and language understanding, facilitating the alignment of multiple components within ALMs. Furthermore, language serves as a versatile human-machine interface, empowering users to instruct LLM agents to collaborate effectively with audio-language systems.
Despite the strong interest shown by the audio community in ALMs, there is still a lack of comprehensive surveys to review the current research status. Existing relevant reviews include speech-language models [12, 13], codec-based models [14], ALMs for specific tasks such as audio-text retrieval [15], automated audio captioning [16], speech-to-text translation [17], and audio-language datasets [18]. Here, we present the first comprehensive survey on ALMs, aiming to achieve an exhaustive coverage of the entire ALM research landscape from the perspective of model training. Additionally, we adopt a perspective centered on general audio-centric tasks that encompasses a diverse range of audio types to provide a more detailed reflection of the current state and development of computer audition. This survey method reflects mutual promotion and constraints among different research aspects from model to data, aids in systematically summarizes challenges and future directions, and serves as a guide for researchers and practitioners interested in ALM techniques, thereby facilitating further academic research and industrial applications in the field.
We first look at recent advances in ALM research and draw the timeline as shown in Fig.1. CLAP[2] is considered a significant milestone. Previous work includes some audio-caption datasets [19, 20, 21], which were initially used for automatic audio caption model training and also served as data foundations for ALMs, inspiring subsequent work. Since the introduction of pre-training and large-scale datasets [22], the advantages of ALMs have gradually gained attention. Recently, numerous new works have emerged, primarily reflecting the intertwined development between pre-training and downstream models. With increasing model research, recent studies have focused on the lack of unified evaluation standards and proposed various benchmarks. It shows a high correlation between datasets, pre-training, downstream models, and benchmark research in ALMs. Additionally, we observe that, driven by commercial applications, research interests have shifted more towards the speech domain. However, audio typically encompasses a variety of environmental events, including human voices, natural sounds, music rhythms, etc., which presents significant challenges to general audio modeling [23].
Figure 1: A timeline of recent advances in audio-language models. Is is established mainly according to the release date (e.g., the submission date to arXiv) and some still working in progress. It highlights that datasets serve as the foundation for inspiring research in pre-training and downstream models. With the advancement of model research, recent studies have developed several benchmarks to promote comprehensive development in the field.
In the subsequent sections of this paper, we first introduce the background of audio-language pre-training and transfer paradigm (Section II). We then describe the foundations of ALMs, including model architecture, training objectives, and evaluation methods (Section III). Following this, we review the topics of representation pre-training (Section IV), downstream transfer (Section V), and related data (Section VI). Building on these foundations, we discuss the challenges and future research directions (Section VII), before concluding the paper (Section VIII).
II Background
This section begins by discussing the development of computer audition paradigms, with a particular focus on how ALMs are trained and transfer for downstream, as well as the reasons for the shift towards the audio-language paradigm. We then introduce the training stages and establish a research landscape for ALMs, providing a structured basis for the comprehensive review in the following sections.
II-A Pre-training and Transfer Paradigm
The pre-training and transfer paradigm involves initially training on large-scale public datasets to get robust representations, and then applying knowledge gained from one context to another to enhance the performance on downstream tasks. This approach accelerates supervised learning on downstream tasks.
However, as this paradigm evolves, two challenges emerge. First, models may overfit by exploiting simple label mappings, achieving high performance on specific tasks without truly understanding the underlying audio content [24], leading to poor generalization to new data. Second, the high cost of manual annotation exacerbates the difficulty of obtaining limited labeled datasets for learning audio representation [25].
To address these challenges, ALMs have been proposed to learn audio concepts through natural language supervision [2]. Firstly, this form of supervision provides more details about the audio, enabling models to understand the meanings and make decisions accordingly like a human. For example, natural language can describe the temporal order of multiple events using words such as ‘simultaneous,’ ‘before,’ and ‘after’ [26], better reflecting the complex composition of audio compared to predefined labels and helping models learn their intrinsic relationships [10]. Additionally, audio-text data is easier to obtain than well-defined labeled datasets, effectively expanding the scale of datasets. For instance, we can use ‘dog’ or ‘barking’ to label a dog barking, but inconsistencies among multiple annotators make it difficult to create a perfectly accurate audio dataset. While ALMs are able to leverage the natural language processing capabilities of pre-trained models to extract similar semantic features from different forms of descriptions. Besides human-annotated captions and translations, titles and descriptions related to audio found abundantly on the web can also serve as sources of text annotation.
II-B Audio-Language Training Stages
As data and model sizes grow, the training strategies for ALMs become more intricate. From the viewpoints of representation learning and downstream task application, we first categorize the training stages aimed at enhancing task-independent audio representations as falling within the scope of pre-training, while fine-tuning and cooperating before the model is applied to downstream tasks are defined as part of the transfer process.
ALMs pre-training can be further divided into multiple stages, typically including the pre-training of foundational models, followed by audio-language pre-training on paired data. Some may also involve further training on a broader range of data and tasks.
Although ALMs have achieved strong zero-shot capabilities in audio retrieval, transfer remains an important stage for applying models to downstream tasks. Task-specific fine-tuning is one of the most widely used methods. It involves supervised fine-tuning of pre-trained models on downstream task datasets and may require the addition of some adaptive modules. Another category of methods includes transferring simultaneously on multiple tasks to make the model more universal or gain from multi-task knowledge sharing. Unlike task-specific fine-tuning, which focuses directly on task performance, instruction tuning and in-context learning aim to enhance (or unlock) the LLM’s ability to follow human instructions. Essentially, it fine-tunes ALMs with a set of formatted instances in natural language form [27], thus helping the model generalize to various downstream tasks. Multi-task transfer can also be achieved by cooperating multiple models to form an agent system.
II-C Research Landscape
Based on current research and our definition of audio-language training stages, we construct a research landscape for ALMs, as shown in Fig. 2. From the training dimension, ALMs are divided into pre-training and transfer. ALMs achieve multimodal perception by integrating pre-trained audio and language models, then undergo further pre-training on extensive audio-text data. Transfer is crucial for combining these models with other networks and applying them to various downstream tasks. Data is an essential element for model training and evaluation. Different types of datasets can be utilized at various stages of training, and benchmarks provide unified and comprehensive standards for model evaluation, playing an important role in optimizing the models. Therefore, research on ALMs can be developed in three fields: (a) pre-training for representation learning, (b) downstream transfer, and (c) datasets and benchmarks.
Figure 2: Research landscape for audio-language models. From the perspective of model training, (a) audio-language representation requires pre-training (Sec. IV), (b) transfer to downstream application through task-specific fine-tuning or instruction tuning (Sec. V), (c) data is the foundation for model training, and they can be divided into labeled audio datasets, audio-text paired datasets, and audio question answering datasets (Sec. VI).
Within the scope of the research landscape, we designed a review outline as shown in Fig. 3. We first provide an overview of the foundation on ALMs, thereby comprehensively reviewing related work from three research fields. According the progress across areas, we systematically propose the challenges and future directions for ALMs.
Figure 3: Research outline on audio-language models for audio-centric tasks.
III Foundations
In this section, we will introduce the general foundations of ALMs, including commonly-used architectures, training objectives, and evaluation methods.
III-A ALM Architectures
Audio-language models and systems typically comprising audio and language encoders, and may include other multimodal alignment mechanisms and language models. As shown in Fig. 4, current ALMs can generally be divided into four types: Two Towers, Two Heads, One Head and Cooperated Systems.
III-A1 Two Towers
The basic form of ALMs, with one encoder and a projector for each modality, embeddings will be aligned in a joint space. Among them, the most prominent landmark pretraining research is Contrastive Language-Audio Pretraining (CLAP), which incorporates a contrastive learning framework to bring audio and text descriptions into a joint multimodal space, learning the mapping relationship between the two modalities [2]. Furthermore, based on the concept of modality alignment, mechanisms can be added between two independent encoders to facilitate communication, with the aim of achieving early-stage modality fusion during the representation phase [28].
III-A2 Two Heads
A mainstream form that utilizes one encoder and a projector for each modality, with a language model on top. Here, ‘Head’ refers to a network that unifies a certain modal representation space into a unified space [29, 30, 31]. Language modeling has first been proven to possess strong capabilities in semantic feature extraction within the field of speech [32], making it a natural design choice to incorporate language models into ALMs. With the development of large language models, many works have utilized LLMs as the backbone for ALM inference, expanding the perceptual modalities of large language models and leveraging their emergent understanding capabilities. This has led to classic works such as SpeechGPT [8], Pengi [1], and Qwen-Audio [33], making Two Heads a unified architecture of Large Audio-Language Models. In this structure, modality fusion can also be promoted through communication mechanisms between encoders [34]. It is important to note that in some works, text inputs may only undergo tokenization without the need for a dedicated text encoder, and these models can be considered under a special type of Two Heads framework.
III-A3 One Head
A unified multimodal input form that uses one encoder to handle two different modalities simultaneously, with a language model on top. In the vision community, a line of work has conducted research on the One Head architecture based on the view that the same multimodal processing module can achieve better alignment. That is, using a unified space to represent two modalities. However, there are relatively few related studies in audio-language [35].
III-A4 Cooperated Systems
This system employs an LLM as a planning agent and comprises various model types mentioned above. Its design facilitates the selection and utilization of each model’s inherent complementary strengths, tailored to downstream task requirements. Through the collaboration of these diverse models, the system can tackle a wider array of complex tasks compared to a solitary model alone [36].
Figure 4: Typical architectures of audio-language models. (a) Two Towers, with one encoder and a projector for each modality, embeddings will be aligned in a joint space. (b) Two Heads, adds language model on top. (c) One Head, with one unified encoder and a language model. (d) Cooperated Systems, utilize LLMs as agents to cooperate several models.
III-B Training Objectives
Figure 5: Illustration of audio-language models training objectives. (a) Pre-training objectives include contrastive, generative, and discriminative objectives, which may be conducted on audio-text or single-modal data. The transfer objectives can be (b) task-specific fine-tuning objectives or (c) generative language modeling objective.
Training objectives are used to guide model learning during pre-training and transfer. As shown in Fig. 5(a), pre-training contrastive, generative, or discriminative objectives guide the model to learn pretext tasks on audio, text, or audio-text paired data, aiming to learn audio semantic features and audio-language correlations. As illustrated in Fig. 5(b), task-specific fine-tuning as a commonly adopted transfer method, employs either generative or discriminative objectives depending on the context. Another line of transfer methods with generative language models in Fig. 5(c) aims to improve unlock pre-training models’ generalization ability on downstream tasks through standard language modeling objectives. Note that the above training objectives can be used in combination.
III-B1 Contrastive Objectives
It is the most commonly used type of training objective in audio-language pre-training, which aims to train the model to bring positive sample pairs closer together and push negative sample pairs further apart within a shared embedding space for audio and text, thereby learning the audio-language correlations and obtaining distinguishable representations between audio samples. The most widely implemented approach for this category of objective is using a symmetric audio-text infoNCE [37] loss function to measure the similarity between audio and text embeddings. Let the i−th𝑖𝑡ℎi-thitalic_i - italic_t italic_h sample pair be xi,tisubscript𝑥𝑖subscript𝑡𝑖{x_{i},t_{i}}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Given an audio encoder ha(⋅)subscriptℎ𝑎⋅h_{a}(\cdot)italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ⋅ ) and a text encoder ht(⋅)subscriptℎ𝑡⋅h_{t}(\cdot)italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ), the embedding vectors for the audio sample xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and its corresponding caption tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be represented as:
| zia=ha(xi)superscriptsubscript𝑧𝑖𝑎subscriptℎ𝑎subscript𝑥𝑖z_{i}^{a}=h_{a}(x_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | (1) |
| zit=ht(ti)superscriptsubscript𝑧𝑖𝑡subscriptℎ𝑡subscript𝑡𝑖z_{i}^{t}=h_{t}(t_{i})italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | (2) |
The similarity between audio and text embeddings is calculated using the dot product. The infoNCE loss for the audio dimension, lasubscript𝑙𝑎l_{a}italic_l start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, is defined as the average of a normalized function measuring the similarity of different texts to the same audio query. Similarly, the contrastive loss for the text dimension, ltsubscript𝑙𝑡l_{t}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, measures the similarity of different audios to the same text query. For a batch with B𝐵Bitalic_B audio-text pairs, we have:
| lia=−logexp(zia⋅zit/τ)∑j=1Bexp(zia⋅zjt/τ)superscriptsubscript𝑙𝑖𝑎⋅superscriptsubscript𝑧𝑖𝑎superscriptsubscript𝑧𝑖𝑡𝜏superscriptsubscript𝑗1𝐵⋅superscriptsubscript𝑧𝑖𝑎superscriptsubscript𝑧𝑗𝑡𝜏l_{i}{a}=-\log\frac{\exp\left(z_{i}{a}\cdot z_{i}{t}/\tau\right)}{\sum_{j=1% }{B}\exp\left(z_{i}{a}\cdot z_{j}{t}/\tau\right)}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT = - roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT / italic_τ ) end_ARG | (3) |
| lit=−logexp(zit⋅zia/τ)∑j=1Bexp(zit⋅zja/τ)superscriptsubscript𝑙𝑖𝑡⋅superscriptsubscript𝑧𝑖𝑡superscriptsubscript𝑧𝑖𝑎𝜏superscriptsubscript𝑗1𝐵⋅superscriptsubscript𝑧𝑖𝑡superscriptsubscript𝑧𝑗𝑎𝜏l_{i}{t}=-\log\frac{\exp\left(z_{i}{t}\cdot z_{i}{a}/\tau\right)}{\sum_{j=1% }{B}\exp\left(z_{i}{t}\cdot z_{j}{a}/\tau\right)}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = - roman_log divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT / italic_τ ) end_ARG | (4) |
where τ𝜏\tauitalic_τ represents a temperature parameter used to scale the range of logits. When setting the contrastive objective to be completely symmetrical, the total loss for the audio-text pairs in one batch can be defined as:
| ℒcon=12B∑i=1B(lia+lit)subscriptℒ𝑐𝑜𝑛12𝐵superscriptsubscript𝑖1𝐵superscriptsubscript𝑙𝑖𝑎superscriptsubscript𝑙𝑖𝑡\mathcal{L}_{con}=\frac{1}{2B}\sum_{i=1}{B}(l_{i}{a}+l_{i}^{t})caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | (5) |
III-B2 Generative Objectives
Generative methods have proven to be powerful and effective in audio representation learning. They lead the network in learning semantic features of audio through pretext tasks such as masked reconstruction [38]. In audio-language pre-training, similar approaches are introduced, guiding representation learning through audio or audio-related language generation tasks. These methods are often combined with contrastive learning to bolster the robustness of learned audio embeddings or improve computational efficiency. During transfer, these generative objectives can help the model adapt to corresponding generative tasks and are widely used in transfer with generative LLMs.
During pre-training, the most common method for audio mask reconstruction is based on the audio spectrogram. Let M(⋅)𝑀⋅M\left(\cdot\right)italic_M ( ⋅ ) denote the masking operation, and let fa(⋅)subscript𝑓𝑎⋅f_{a}\left(\cdot\right)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( ⋅ ) and pae(⋅)subscript𝑝𝑎𝑒⋅p_{ae}\left(\cdot\right)italic_p start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT ( ⋅ ) represent the spectrogram encoder and audio embedding projection layer, respectively. To achieve masked spectrogram prediction, an additional decoder fa−1(⋅)superscriptsubscript𝑓𝑎1⋅f_{a}{-1}\left(\cdot\right)italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) is added to the model. For an audio sample with the original spectrogram a𝑎aitalic_a, spectrogram reconstruction can be represented as a=fa−1(pae(fa(M(a))))𝑎superscriptsubscript𝑓𝑎1subscript𝑝𝑎𝑒subscript𝑓𝑎𝑀𝑎\hat{a}=f_{a}{-1}(p_{ae}(f_{a}(M(a))))over^ start_ARG italic_a end_ARG = italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_a italic_e end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_M ( italic_a ) ) ) ). Using a^nsubscript^𝑎𝑛\hat{a}_{n}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to denote the decoder prediction output of the n−th𝑛𝑡ℎn-thitalic_n - italic_t italic_h masked spectrogram patch and the original true patch, respectively. For a spectrogram divided into N𝑁Nitalic_N patches, the audio reconstruction loss used for self-supervision can be defined as minimizing the L2𝐿2L2italic_L 2 (mean squared error, MSE) loss:
| ℒar=1N∑n=1N‖a^i−ai‖2subscriptℒ𝑎𝑟1𝑁superscriptsubscript𝑛1𝑁subscriptnormsubscript^𝑎𝑖subscript𝑎𝑖2\mathcal{L}_{ar}=\frac{1}{N}\sum_{n=1}{N}\left|\hat{a}_{i}-a_{i}\right|_{2}caligraphic_L start_POSTSUBSCRIPT italic_a italic_r end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ over start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | (6) |
Since ALMs include both audio and language modalities as inputs, some works have similarly designed masked cross-modal reconstruction tasks, which typically involve methods such as cross-attention mechanisms to communicate between the encoders of the two modalities and perform reconstruction on the audio representation.
During audio generation transfer, training objectives essentially enhance the model’s performance by minimizing the distance between the predicted embedding z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG and its corresponding ground truth z𝑧zitalic_z. This distance metric can be chosen based on the situation, with common options including L1𝐿1L1italic_L 1 and L2𝐿2L2italic_L 2 distances. The training objective can also be set as a weighted sum of multiple distances. For an audio sample, generative audio modeling objective can be represented as:
| ℒam=1T1L∑t=1T∑l=1Lα‖z^t,l−zt,l‖1+β‖z^t,l−zt,l‖2subscriptℒ𝑎𝑚1𝑇1𝐿superscriptsubscript𝑡1𝑇superscriptsubscript𝑙1𝐿𝛼subscriptnormsubscript^𝑧𝑡𝑙subscript𝑧𝑡𝑙1𝛽subscriptnormsubscript^𝑧𝑡𝑙subscript𝑧𝑡𝑙2\mathcal{L}_{am}=\frac{1}{T}\frac{1}{L}\sum_{t=1}{T}\sum_{l=1}{L}\alpha\left% |\hat{z}_{t,l}-{z}_{t,l}\right|_{1}+\beta\left|\hat{z}_{t,l}-{z}_{t,l}% \right|_{2}caligraphic_L start_POSTSUBSCRIPT italic_a italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_α ∥ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_β ∥ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_t , italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | (7) |
where T𝑇Titalic_T denotes the total number of frames, L𝐿Litalic_L denotes embedding dimension, and α𝛼\alphaitalic_α and β𝛽\betaitalic_β are weight hyperparameters. In addition to the method that uses embedding differences as a training objective, it is also possible to directly train jointly with the decoder network, designing the training objective directly on the predicted audio amplitude. For instance, aiming to learn a decoder net hde(⋅)subscriptℎ𝑑𝑒⋅h_{de}\left(\cdot\right)italic_h start_POSTSUBSCRIPT italic_d italic_e end_POSTSUBSCRIPT ( ⋅ ) that maps known audio xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and query tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a predicted audio a^isubscript^𝑎𝑖\hat{a}_{i}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Denote zitsuperscriptsubscript𝑧𝑖𝑡z_{i}{t}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as the embedding of the language query, the training objective could be to minimize the L1𝐿1L1italic_L 1 (mean absolute error, MAE) loss between the amplitude spectrogram |ai|subscript𝑎𝑖|a_{i}|| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | of the ground truth target audio source and the predicted |a^i|subscript^𝑎𝑖|\hat{a}_{i}|| over start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |:
| |a^i|=hde(zit)subscript^𝑎𝑖subscriptℎ𝑑𝑒superscriptsubscript𝑧𝑖𝑡|\hat{a}_{i}|=h_{de}\left(z_{i}{t}\right)| over start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = italic_h start_POSTSUBSCRIPT italic_d italic_e end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) | (8) |
| ℒam′=∑i=1B‖|ai|−|a^i|‖1subscriptsuperscriptℒ′𝑎𝑚superscriptsubscript𝑖1𝐵subscriptnormsubscript𝑎𝑖subscript^𝑎𝑖1\mathcal{L}{{}{\prime}}_{am}=\sum_{i=1}{B}\left||{a}_{i}|-|\hat{a}_{i}|% \right|_{1}caligraphic_L start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_m end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∥ | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | - | over start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | (9) |
Generative language modeling objectives are used to guide ALM in generating audio-related text that is consistent with the ground truth. On one hand, they can be used to force the model to learn audio-language correlations to promote representation learning, and help improve the model’s performance on corresponding downstream tasks (e.g., automatic caption generation). On the other hand, as a standard loss for generative language modeling, it is also commonly used during ALM transfer with language model [39].
An additional text decoder (language pre-trained model) is required in language generation. When using an autoregressive language model to predict tokenized text associated with a given audio sample x𝑥xitalic_x, the language modeling objective is defined as minimizing the negative log-likelihood of the current ground-truth token (cross-entropy, CE loss), given the previous ground-truth tokens:
| ℒlm=−1T∑t=1TlogPθ(yt∣y1:t−1,x)subscriptℒ𝑙𝑚1𝑇superscriptsubscript𝑡1𝑇subscript𝑃𝜃conditionalsubscript𝑦𝑡subscript𝑦:1𝑡1𝑥\mathcal{L}_{lm}=-\frac{1}{T}\sum_{t=1}^{T}\log P_{\theta}\left(y_{t}\mid y_{1% :t-1},x\right)caligraphic_L start_POSTSUBSCRIPT italic_l italic_m end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_y start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT , italic_x ) | (10) |
Here, ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the t−th𝑡𝑡ℎt-thitalic_t - italic_t italic_h ground-truth token of the given caption y𝑦yitalic_y, T𝑇Titalic_T is the total length of the caption, and θ𝜃\thetaitalic_θ represents the model’s learnable parameters. Non-autoregressive language models also adopt a similar negative log likelihood objective without temporal averaging.
III-B3 Discriminative Objectives
They are used to guide the model in learning to predict the correct label, and can be broadly categorized into classification and retrieval objectives. Here, we take the cross-entropy function as an example to uniformly calculate the loss between the predicted output and the ground truth.
Audio classification is one of the most extensively studied downstream tasks. It aims to recognize patterns from specific audio inputs to predict given labels. For a batch of B𝐵Bitalic_B audio samples, the objective can be expressed as:
| ℒcls=−1B∑i=1B∑c=1Cyi,clog(p^i,c)subscriptℒ𝑐𝑙𝑠1𝐵superscriptsubscript𝑖1𝐵superscriptsubscript𝑐1𝐶subscript𝑦𝑖𝑐subscript^𝑝𝑖𝑐\mathcal{L}_{cls}=-\frac{1}{B}\sum_{i=1}{B}\sum_{c=1}{C}y_{i,c}\log(\hat{p}_% {i,c})caligraphic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT roman_log ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT ) | (11) |
where C𝐶Citalic_C is the number of classes. yi,csubscript𝑦𝑖𝑐y_{i,c}italic_y start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT is the true label of the i−th𝑖𝑡ℎi-thitalic_i - italic_t italic_h sample in class c𝑐citalic_c (0 or 1). p^i,csubscript^𝑝𝑖𝑐\hat{p}_{i,c}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT is the predicted probability of the i−th𝑖𝑡ℎi-thitalic_i - italic_t italic_h sample in class c𝑐citalic_c.
Audio-Text Retrieval (ATR) aims to find matching items between audio clips and textual descriptions. Given a query in one modality (audio or text), the goal is to retrieve the corresponding item from a pool of candidates in the other modality. Here, we use a scoring function S(⋅)𝑆⋅S\left(\cdot\right)italic_S ( ⋅ ) to represent the model’s prediction output by measuring the correlation between audio and text. Denote Y𝑌Yitalic_Y as a set of m𝑚mitalic_m possible caption texts, the correspondence caption of a given audio xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is
| y^i=argmaxyj∈Yexp(S(zia,zjt))∑k=1mexp(S(zia,zkt))subscript^𝑦𝑖subscriptsubscript𝑦𝑗𝑌𝑆superscriptsubscript𝑧𝑖𝑎superscriptsubscript𝑧𝑗𝑡superscriptsubscript𝑘1𝑚𝑆superscriptsubscript𝑧𝑖𝑎superscriptsubscript𝑧𝑘𝑡\hat{y}_{i}=\arg\max_{y_{j}\in Y}\frac{\exp(S(z_{i}{a},z_{j}{t}))}{\sum_{k=1% }{m}\exp(S(z_{i}{a},z_{k}{t}))}over start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_Y end_POSTSUBSCRIPT divide start_ARG roman_exp ( italic_S ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_exp ( italic_S ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) end_ARG | (12) |
Then, retrieval tasks can be considered as instance-level classification, so the objective can be formatted as:
| ℒatr=−∑i=1Blog(y^i)subscriptℒ𝑎𝑡𝑟superscriptsubscript𝑖1𝐵subscript^𝑦𝑖\mathcal{L}_{atr}=-\sum_{i=1}{B}\log(\hat{y}_{i})caligraphic_L start_POSTSUBSCRIPT italic_a italic_t italic_r end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT roman_log ( over start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | (13) |
Specially, audio-text matching is pretext task designed to forcing a more fine-grained alignment between audio and text embeddings than contrastive pre-training. It train the model to predict whether a given text correctly describes a provided audio, can be seen as a binary classification task requiring the model to determine whether an audio-language pair is a match or not. The matching objective can be defined as:
| ℒmat=plog𝒮(za,zt)+(1−p)log(1−𝒮(za,zt))subscriptℒ𝑚𝑎𝑡𝑝𝒮superscript𝑧𝑎superscript𝑧𝑡1𝑝1𝒮superscript𝑧𝑎superscript𝑧𝑡\mathcal{L}_{mat}=p\log\mathcal{S}\left(z^{a},z^{t}\right)+(1-p)\log\left(1-% \mathcal{S}\left(z^{a},z^{t}\right)\right)caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_t end_POSTSUBSCRIPT = italic_p roman_log caligraphic_S ( italic_z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( 1 - italic_p ) roman_log ( 1 - caligraphic_S ( italic_z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) | (14) |
Here, p𝑝pitalic_p is 1 if the audio and text are paired, otherwise it is 0.
III-C Evaluation Methods
Model evaluation aims to fairly measure the performance of models under the same experimental setup and tasks. The evaluation methods for ALMs mainly include zero-shot (ZS), linear probe, supervised fine-tuning, and instruction-following evaluation. Each of these methods has their own focus, collectively forming the basis for a comprehensive performance evaluation of ALMs.
III-C1 Zero-Shot Evaluation
It focuses on assessing the ability of contrastive ALMs in open-set retrieval. This zero-shot prediction is primarily conducted by measuring the similarity between audio and text embeddings. Notably, aside from direct text-to-audio or audio-to-text retrieval, considering that labels are also a special form of language. This allows for zero-shot evaluation on classification tasks such as sound event detection and emotion recognition.
III-C2 Linear Probe Evaluation
It is a common experimental setup for evaluating pre-trained models, and it is used to assess the audio representation of ALMs. It involves adding a linear header (usually an MLP) to the frozen pre-trained model and training the header on downstream tasks, allowing the model to be adapted for specific tasks and datasets. Although this simple transfer learning setup may not achieve optimal performance on specific tasks, it minimizes the variables introduced, hence its widespread adoption for conducting fair representational evaluations. In linear probe evaluation, the selected tasks are usually fundamental linear tasks like classification.
III-C3 Supervised Fine-tune Evaluation
It further examines the generalization ability of the pre-trained model to downstream tasks and its task-specific performance. For a given downstream task, the audio encoder is unfrozen and fine-tuned along with an attached head. The model’s performance is then validated on the test set and compared with state-of-the-art (SOTA) models for that task. This evaluation approach not only