Introduction
Large language models (LLMs) trained on massive text corpora have been performing remarkably on various natural language understanding and generation tasks1,2,[3](#ref-CR3 “Raffel, C…
Introduction
Large language models (LLMs) trained on massive text corpora have been performing remarkably on various natural language understanding and generation tasks1,2,3,4,5,6. In recent years, the exploration of language models (LMs) has gone beyond the domain of natural language processing (NLP), reaching into the realms of biology and its data. A vast amount of sequenced protein data provided a ground for training protein LMs, and since they have proven to be an extremely valuable asset in protein generative7,8,9 and structure prediction tasks10,11.
Most efforts in applying the ideas originally developed for NLP have been focused on proteins after the success of AlphaFold12 in the prediction of protein structures. ESM-1b13 was one of the first models that applied the self-supervised language modeling approaches to protein data. It was pre-trained on 250M protein sequences and tested on several downstream tasks, including the secondary structure and tertiary contact prediction, where it achieved state-of-the-art results. Later on, several other protein LMs were proposed and tested on various downstream tasks14,15,16,17. Protein LMs play an important role in protein tertiary structure prediction. ESM-211 and OmegaPLM10 are examples of protein LMs that efficiently replace a multiple sequence alignment (MSA) step in deep learning (DL) methods for structure prediction.
RNAs play crucial roles in fundamental biological processes, including transcription, cell signaling, chromatin remodeling, and genome imprinting. Like proteins, RNAs have recently become an attractive drug target, whose function and interaction with other molecules are closely related to their structure18,19. However, much less attention has been given to applying LMs to RNA-related problems, partly because there is no such amount of available data and corresponding structures, and partly because similar problems tend to be more difficult than for proteins. Furthermore, RNA structure prediction is severely hindered by the scarcity of high-resolution structural data and the lack of unbiased, diverse sequence alignments, which are often used as a source of evolutionary information20.
Currently, there are only two single-input-sequence RNA foundation models, RNA-FM21 and Uni-RNA22, that have found applications in several structure and function prediction tasks. RNA-FM is a 100M parameters Transformer encoder based on the original implementation by ref. 23 and trained exclusively on 23.7M non-coding RNAs (ncRNAs) from the RNAcentral database24,22 pre-trained an ensemble of LMs ranging from 25M to 400M parameters trained on a much larger dataset of 1B sequences with the architecture analogous to the ESM protein LM13 enhanced by several advanced techniques such as RoPE and fused layer norm. The authors pre-trained language models of different sizes and reported when the model parameters exceeded 400M, the performance in downstream tasks reached a plateau with their architectures and datasets. Both foundation models used the standard BERT-style masked language modeling (MLM) pre-training task1. Unlike the Uni-RNA, RNA-FM is publicly available.
Besides the RNA-FM and Uni-RNA foundation models, several authors proposed LMs to solve a few specific downstream tasks. RNA-MSM25, an MSA-based BERT-style RNA LM, specialized in particular for secondary structure prediction, that instead of a single input sequence utilizes a set of homologous sequences. However, obtaining the MSAs is a very time-consuming procedure—it takes RNAcmap, a homology search tool used by RNA-MSM, on average 9 h to obtain an MSA for one RNA sequence of length 6025,26 proposed SpliceBERT, a BERT-style encoder pre-trained exclusively on more than 2M precursor messenger RNA (pre-mRNA) sequences from different vertebrates for studying RNA splicing. SpliceBERT outperforms DNABERT27, an LM trained only on a human genome, both on human and non-human splice-site prediction tasks. It demonstrates better generalization capability of LMs pre-trained on multiple species28. Proposed single-cell BERT (scBERT), a BERT-style LM pre-trained on huge amounts of unlabeled single-cell RNA-seq data for cell type annotation. BigRNA29 is an LM pre-trained on the genomes of 70 individuals on a task to predict DNA-matched RNA-seq data. BigRNA accurately predicts tissue-specific RNA expression and the binding sites of proteins and microRNAs. UTR-LM30 is an RNA language model for 5′ untranslated region (5′ UTR) of mRNAs, which was pre-trained on endogenous 5′ UTRs from multiple species. It is specialized for mRNA translation-related downstream tasks such as mRNA translation efficiency and expression level prediction.
Motivated by the recent successes of protein LMs and the latest architectural improvements in LLMs, we propose RiNALMo, a novel RNA language model. We pre-trained RiNALMo on a set of carefully curated 36M ncRNA sequences from the RNAcentral database augmented by several other RNA databases. RiNALMo is a 650M parameters BERT-style Transformer encoder advanced by modern architectural techniques such as rotary positional embedding (RoPE)31, SwiGLU activation function32, and FlashAttention-233. During pre-training, RiNALMo can extract hidden knowledge and capture the underlying structural information embedded within the sequences at the single-nucleotide level. Later, its output embeddings serve as a powerful sequence representation that improves the performance on various structural and functional RNA downstream tasks compared to other foundation models and state-of-the-art methods. In particular, RiNALMo shows remarkable generalization capability on secondary structure prediction of RNA families not encountered in the training dataset where other DL methods fail.
The main contributions of the paper are as follows:
We propose RiNALMo, a 650M parameters RNA LM, which is the largest RNA language model to date that can fully leverage the potential of a vast amount of public unannotated RNA sequences;
We show that the generalization capability of RiNALMo can overcome the problem of other DL methods for secondary structure prediction to perform well on RNA families not seen in the training dataset;
We conducted extensive experiments on several RNA structural and functional downstream tasks whose results show that RiNALMo outperforms other RNA LMs and DL methods on most datasets.
We release the pre-trained and fine-tuned RiNALMo weights and scripts for fine-tuning the model for the downstream tasks.
Results
General-purpose RNA language model
A schematic diagram of RiNALMo and its pre-training procedure and downstream tasks is shown in Fig. 1. Our LM is a Transformer encoder focused on understanding and unveiling the RNA code. At the heart of the model is the self-attention mechanism23, which captures important local and global contextual information. We pre-trained RiNALMo using the MLM, where we tasked the model to reconstruct corrupted, unlabeled RNA sequences. In this paper, each nucleotide is a single token. To corrupt the input sequence, we randomly mask 15% of the tokens in the training sequence. To reconstruct the masked tokens, RiNALMo’s embeddings are utilized by the MLM prediction head whose outputs are used in the cross-entropy loss function. More technical details, pretraining details, and ablation study are given in “Methods” and Supplementary Information.
Fig. 1: RiNALMo pre-training and applications.
In the pre-training stage, RiNALMo is trained on unlabeled RNA sequences from several databases using masked language modeling (MLM). To corrupt the input sequence, we randomly mask 15% of the tokens in the training sequence. Before being passed to the Transformer, an RNA sequence is tokenized and turned into a 1280-dimensional vector using a learned input embedding module. The language model comprises 33 Transformer blocks. Each Transformer block consists of a multi-head attention and a feed-forward network. Once pre-trained, RiNALMo can be separately fine-tuned for various structural and functional downstream tasks in which its expressive output embeddings, utilized by the prediction heads, significantly improve performance. In this work, we fine-tuned RiNALMo for secondary structure, multi-species splice-site, mean ribosome loading (MRL), translation efficiency (TE), and expression level (EL) prediction, as well as for ncRNA family classification.
Once pre-trained, RiNALMo’s output embeddings can serve as a powerful sequence representation that has embedded structural and evolutionary information. First, its embeddings can be used for visualization and clustering analysis of RNA sequences. Second, such a representation can be used as an enriched input to structural and functional downstream tasks. We employed RiNALMo in a few tasks to assess its performance and generalization capabilities. Namely, we show how RiNALMo can improve and generalize well on secondary structure, multi-species splice-site, translation efficiency (TE), expression level (EL), and mean ribosome loading (MRL) prediction tasks as well as for multi-class ncRNA family classification tasks. However, we anticipate it can be leveraged in many other tasks related to RNA structure and function. Particularly interesting would be the employment of RiNALMo in RNA tertiary structure prediction tasks, where, motivated by the results from ESMFold11 and OmegaFold10, we believe RiNALMo’s embeddings can successfully replace the MSA.
To analyze the interpretability of our model, we visualized pre-trained RiNALMo’s sequence representations by applying t-SNE on the classification token embeddings for RNAs from a secondary structure prediction dataset and an ncRNA functional family classification dataset (see Fig. 2). RNA structures and functions vary across different RNA families, and we expect RiNALMo has learned these properties during the MLM pre-training and is able to encode them within its RNA sequence representations. We compared the classification token embeddings of RiNALMo and RNA-FM21. Contrary to the RNA-FM’s embedding space, in the RiNALMo’s embedding space, the RNAs are clustered by families with, in general, clean boundaries between clusters. This is especially evident when comparing Fig. 2c and Fig. 2d, which illustrate the embeddings of RNAs from the ncRNA functional family classification dataset. The analyses of embeddings revealed that RNAs with similar structure and function properties are grouped, implicating that RiNALMo has learned these properties beyond their primary structure. Furthermore, the t-SNE visualization shows RiNALMo’s ability to cluster and distinguish different RNA families and confirms it can be used in various clustering analyses of RNA sequences. Later in “Results”, RiNALMo’s ability to reason beyond RNA primary structure was additionally backed by the state-of-the-art performance on an ncRNA Rfam family classification downstream task. The t-SNE visualization is important from the RNA structure perspective as well, since the RNAs from the same families fold similarly, meaning the structure information is also contained in the embedding space. This was later backed up by the RiNALMo’s state-of-the-art performance on the inter-family generalization secondary structure prediction downstream task.
Fig. 2: t-SNE visualizations of RNA sequence embeddings outputted by RNA-FM21 and RiNALMo.
RNA-FM (a) and RiNALMo (b) classification token embeddings for the inter-family generalization evaluation dataset. RNA-FM (c) and RiNALMo (d) classification token embeddings for one part of the ncRNA functional family classification task dataset.
Fine-tuning RiNALMo for intra-family secondary structure prediction
When RNAs fold into complex structures, many of their bases pair up and form hydrogen bonds. These pairs are vital for the structure’s stability and function. These bonds can be represented by secondary structure, which can tell us a lot about RNA and which is often used as an input to the tertiary structure prediction tools. An example of a secondary structure can be seen in Fig. 3a.
Fig. 3: Secondary structure prediction.
a RNAs fold into various shapes according to their function and while doing so, many of their nucleotides pair up using a hydrogen bond. These pairings are crucial for structural stability and form structural motifs such as hairpin loops and bulges. b, RiNALMo produces nucleotide embeddings for the given RNA sequence. Nucleotide pair embeddings are constructed by applying outer concatenation to RiNALMo’s outputs. Finally, pair representations are fed into the convolutional bottleneck residual neural network (ResNet) which produces base pairing probabilities that are then converted into the final secondary structure prediction. c, Precision, recall and F1 performance of different deep learning models on the TS0 evaluation dataset. d Distribution of F1 scores for predictions of different models on the TS0 dataset (sample size n = 1305). e Precision, recall and F1 performance of different structure prediction tools on the TestSetB evaluation dataset. f Distribution of F1 scores for predictions of different structure prediction tools on the TestSetB dataset (n = 430). Cfold denotes CONTRAFold and RNAstruct denotes RNAstructure. g A target RNA from the TS0 evaluation dataset and its predictions from different deep learning models. In (c, e), the best result for each metric is shown in bold. In (d, f), Box plots show the median (center line), 25th and 75th percentiles (bounds of box), whiskers extending to the smallest and largest values within 1.5× the interquartile range, and individual outliers beyond the whiskers.
Most popular secondary structure prediction tools often rely on thermodynamic models, aiming to identify secondary structures that possess the lowest free energy34. There are also popular probabilistic methods based on statistical learning procedures that act as an alternative to free energy minimization methods, such as CONTRAfold35. Several DL methods have been developed as well. They often outperform the thermodynamic models on RNA families on which they were trained, i.e., on in-distribution data.
We fine-tuned RiNALMo on a simple binary classification task with binary cross-entropy loss, where we tasked the model to classify each nucleotide pair as either paired or unpaired. The pipeline for determining secondary structures is illustrated in Fig. 3b. We utilized a dataset proposed in ref. 36 and compared our model to RNA-FM21 and popular DL methods specialized for secondary structure prediction SPOT-RNA36, UFold37, and MXfold238. The proposed dataset is derived from the bpRNA database39 which compiled secondary structures from seven different sources. Most structures were obtained with comparative sequence analysis while a smaller portion was extracted from atomic coordinates from PDB40 using the annotation tool RNAView41. Authors filtered out redundant RNAs by clustering similar sequences which yielded 13,419 non-redundant secondary structures which were then randomly split into training, validation and testing datasets denoted as TR0, VL0 and TS0, respectively. All models were trained on the same training dataset (TR0) except SPOT-RNA which was additionally fine-tuned on a smaller dataset derived from PDB40. As can be seen in Fig. 3c, RiNALMo outperforms other state-of-the-art DL approaches in terms of precision, recall and consequently F1 score. We provide F1 score distributions in Fig. 3d. A TS0 target example and the predictions from different DL methods are given in Fig. 3g.
Beyond sequence similarity, it is important to consider structure similarity and evaluate the ability of structure prediction tools to generalize to RNAs structurally dissimilar to ones found in the training datasets. Therefore, RiNALMo was further evaluated using the datasets TrainSetA and TestSetB proposed by Rivas et al.42. TrainSetA consists of RNAs collected from datasets proposed in several different studies35,43,44. To ensure sequence diversity, similar sequences were removed, leaving a total of 3166 RNAs in the dataset. TestSetB contains RNAs from 22 Rfam45 families with known structures not found in TrainSetA, making it structurally dissimilar. RNAs from these families are then filtered by removing similar sequences, resulting in 430 RNAs in the TestSetB. The model was first fine-tuned with RNAs from the TrainSetA dataset. All RNAs in this dataset longer than 500 nucleotides were used for validation. Once trained, the model was evaluated on the TestSetB. RiNALMo’s performance on the TestSetB dataset was compared to RNAstructure34, ContraFold46, MXfold238, and RNA-FM21. As shown in Fig. 3e, f, RiNALMo outperforms other tools in terms of F1 score.
RNA language models can generalize well on inter-family structure prediction tasks
While DL methods for secondary structure prediction outperform thermodynamic models on in-distribution data, they are usually unable to generalize well on new RNA families47,48. This is a severe limitation as it hinders the practical usage of such tools.
To test the generalization capabilities of RiNALMo, we utilized the benchmark proposed by ref. 47. The benchmark utilizes the ArchiveII dataset49,50 of 3865 RNAs from nine families which was split nine times, and in each split, a different family was held out for evaluation while the other eight families were used for training and validation. RiNALMo’s ability to generalize across different RNA families was compared to RNA-FM21, popular thermodynamics-based tool RNAstructure34, widely used probabilistic method CONTRAfold35, and two DL approaches specialized for secondary structure prediction UFold37 and MXFold238. The LMs were fine-tuned and the other two DL models were separately trained on each of the previously described dataset splits and evaluated on a corresponding unseen RNA family. For predicting the secondary structures using CONTRAfold, we used EternaFold parameters46 trained on the EternaBench dataset, a set of more than 20,000 RNAs.
Average F1 scores and F1 score distributions for different dataset splits are shown in Fig 4. Fine-tuned RiNALMo demonstrates that it is capable of inter-family generalization as it outperforms RNAstructure and CONTRAfold in eight out of nine families by high margins, unlike other DL models. To the best of our knowledge, this is the first paper to show that LMs can generalize well on inter-family secondary structure prediction, mitigating the limitations of other DL methods. We noted, however, that RiNALMo struggles to generalize on telomerase RNAs, but it achieves the highest F1 score on all other families. Visualization of RiNALMo’s sequence embeddings for all RNAs in the dataset is presented in Fig. 2b. One can notice that telomerase RNAs are clustered together, however, there is no clear boundary between them and SRP RNAs. We also noticed that telomerase RNAs are the longest in the dataset, on average around 25% longer than the second-longest in the dataset. Please refer to Supplementary Table S2 for the mean lengths and standard deviations of the RNA families in the ArchiveII dataset. Interestingly, UFold performs best on telomerase RNA, while achieving much worse results on the other families. We are currently unable to conclude why RiNALMo fails on telomerase RNAs, but we will take more focus on this problem in the future. Technical details and more secondary structure prediction results and examples can be found in the Secondary Structure Prediction section, Supplementary Note 2 and Supplementary Fig. S3.
Fig. 4: Inter-family secondary structure prediction.
a, Average F1 scores for secondary structure prediction on the ArchiveII evaluation datasets. The best result for each evaluation dataset in the tables is shown in bold. b, Distribution of F1 scores for different methods on the ArchiveII evaluation datasets (sample sizes n: 1, 283, 918, 557, 462, 454, 74, 67, 35, and 15, respectively). Box plots show the median (center line), 25th and 75th percentiles (bounds of box), whiskers extending to the smallest and largest values within 1.5× the interquartile range, and individual outliers beyond the whiskers. Minimum and maximum values are 0 and 1, respectively.
RiNALMo’s secondary structure prediction generalization capabilities were also compared to homology-based tools CentroidHomfold51, locARNA52, and CentroidAlifold53. To identify RNA homologs, we used the approach presented in ref. 54. Using the Infernal’s cmscan55 tool, we first determine which RNA family the target RNA belongs to. Then, from the seed alignment of the identified family, we randomly select up to 19 RNAs that share between 65% and 95% sequence identity with the target sequence. The identified homologs and the target RNA sequence are forwarded to CentroidHomfold and locARNA. Besides the structure prediction, locARNA also outputs alignment of given RNAs which is then forwarded to CentroidAlifold since it requires sequence alignment as an input. It is worth noting that locARNA and CentroidAlifold output consensus structure prediction, so to obtain the structure prediction of the target RNA, the consensus structure is mapped onto the target sequence. To ensure a fair comparison, RNAs for which no homologs were found have been excluded from the evaluation. 16S and 23S rRNAs were not included in the comparison as they are split into independent folding domains, making the identification of homologs more challenging than for other families. Except for telomerase RNAs, RiNALMo outperformed or showed comparable performance to other tools across all families. A detailed comparison of RiNALMo’s performance with homology-based methods can be found in Fig. 5.
Fig. 5: Inter-family secondary structure prediction for RiNALMo and homology-based tools.
a Secondary structure prediction average F1 scores for the ArchiveII evaluation datasets. The numbers in brackets next to each RNA family name represent the count of RNAs for which at least one homolog was identified, followed by the total number of RNAs in that dataset. RNAs for which no homologs were found are ignored. The best result for each evaluation dataset in the tables is shown in bold. b Distribution of secondary structure prediction F1 scores for different tools on the ArchiveII evaluation datasets (sample sizes n: 1278, 738, 510, 456, 442, 37, and 35, respectively). Box plots show the median (center line), 25th and 75th percentiles (bounds of box), whiskers extending to the smallest and largest values within 1.5× the interquartile range, and individual outliers beyond the whiskers. Minimum and maximum values are 0 and 1, respectively.
Fine-tuning RiNALMo for classification tasks
RiNALMo can also be fine-tuned for downstream tasks that determine important functions of the observed RNA. Splice-site prediction and determination of the ncRNA family are two important functional tasks that can be cast as classification tasks.
RNA splicing plays an important role in eukaryotic gene expression, involving the removal of introns from pre-mRNAs and the ligation of exons to form mature mRNAs (see Fig. 6a). Precisely pinpointing splice sites-the donor and acceptor sites that mark the boundaries between exons and introns, and vice versa-is essential for accurately predicting gene structure and location.
Fig. 6: RNA functional classification tasks.
Splice-site prediction. a A pre-mRNA transcript consists of non-coding, i.e., introns, and coding regions, i.e., exons. Introns are located between two exons of a gene. As part of the RNA processing pathway, introns are removed by cleavage at splice sites. These sites are found at (5{\prime}) and (3{\prime}) ends of introns, known as donor and acceptor splice sites, respectively. Most frequently, the (5{\prime}) end of introns begins with the dinucleotide GU, and the (3{\prime}) end of introns ends with AG. b An input to RiNALMo is a 400-nucleotide-long RNA sequence from the GS_1 dataset. We utilize only the CLS embedding that then passes through a two-layer MLP classification head. The output layer gives information on whether a sequence contains a donor/acceptor site or not. c Classification F1 score for splice-site prediction. Here, we report the average value of donor and acceptor prediction results. ncRNA family classification. d Given an RNA sequence the goal is to classify its ncRNA family. The procedure is again similar to the procedure in (b): Original and noisy RNA sequences from the Rfam dataset are input to RiNALMo. We utilize only the CLS embedding that then passes through a two-layer MLP classification head. The output layer determines which of the 88 Rfam families the input ncRNA belongs to. e ncRNA family classification accuracy for noiseless and noisy input sequences and the average accuracy. In (c and e), FT denotes whether we fine-tuned the model or represented direct citations from the original papers with the same split train/test datasets. The best result for each evaluation dataset is shown in bold.
Identifying the splice sites can be cast as a binary sequence-level classification task. A widely used dataset of positive and negative subsets of splice-site sequences was proposed in ref. 56. The dataset was constructed by randomly selecting sequences from the exon/intron regions of the G3PO+ genomic sequences57. The dataset consists of error-free splice-site sequences from a diverse set of 148 eukaryotic organisms, including humans. The test dataset consists of four different species not seen in the training dataset.
We separately fine-tuned the model, first for donor and then for acceptor splice-site prediction. The splice-site prediction pipeline using RiNALMo embeddings is illustrated in Fig. 6b. Finally, we compared our model’s performance with other RNA LMs RNA-FM21 and Uni-RNA22, and several established methods such as Spliceator56 and SpliceBERT26. We present the results in Fig. 6c. Separate results for donor and acceptor splice-site prediction can be found in Supplementary Tables S6 and S7, respectively. Figure 6c reports the average value of donor and acceptor prediction results. We fine-tuned other models and used the same prediction head if they were publicly available and easy to fine-tune. Our fine-tuned model outperforms other models, showing its powerful generalization properties. Notice that RiNALMo even outperforms SpliceBERT, an LLM pre-trained exclusively on pre-mRNA sequences. More details on the splice-site prediction task and the model’s hyperparameters can be found in the Multi-Species Splice-Site Prediction section and Supplementary Note 4.
Non-coding RNAs are RNA molecules that play vital regulatory roles in a wide range of biological processes. Among the most abundant and functionally significant ncRNAs are transfer RNAs and ribosomal RNAs (rRNAs), both playing an important part in protein synthesis, microRNAs which are essential in regulating gene expression, small nuclear RNAs involved in the processing and splicing of pre-mRNA, etc.
Determining the family of ncRNA is a multiclass classification task. We utilized RiNALMo to predict short noncoding RNA functional families from Rfam58 using the sequence as an input. The dataset and data preprocessing were adopted from ref. 59. After data preprocessing, the dataset comprised ncRNAs shorter than 200 nucleotides arranged in 88 different Rfam families. We assessed the classification performance of RiNALMo for the original ncRNA sequences, for which we denoted the experiment as 0% boundary noise, and sequences with random nucleotides, equivalent to 100% of the sequence length, added at both ends of the original sequence. Random parts maintained the same single-nucleotide and di-nucleotide frequencies as the original sequence. The second experimental setup we denoted as 200% boundary noise. This way, we introduced the uncertainty of where the ncRNA sequence starts and ends. Adding random nucleotides to the original ncRNA sequences was also adopted from ref. [59](https://www.