Prediction and analysis of anti-aging peptides using data augmentation and machine learning algorithms

Background

Aging is a complex physiological process mediated by multiple biological and genetic pathways. It is directly associated with lifespan and serves as a major driving force behind all age-related diseases [1, 2]. Disorders arising from aging, such as cancer, cardiovascular diseases, neurodegenerative conditions, and metabolic dysfunctions, not only significantly contribute to increased mortality among the elderly but also pose substantial challenges to global healthcare systems [3,4,5,6,7]. Therefore, developing effective strategies to design anti-aging interventions is critical for extending human lifespan.

In recent years, a variety of anti-aging strategies have been developed to target aging-related pathways [8]. Among them, anti-aging peptides (AAPs), which are collectively defined as those that can improve aging markers (such as SA-β-gal, inflammatory factors, telomere length, etc.), extend the lifespan of model organisms, and regulate classical aging pathways, have attracted considerable attention as a promising class of bioactive molecules for therapeutic intervention [9,10,11]. They are collectively defined as anti-aging peptides. These peptides can participate in intercellular communication, regulate enzymatic activities (e.g., collagenases), facilitate the transport and delivery of biomolecules, and inhibit the production of neurotransmitters such as acetylcholine, thereby contributing to the mitigation of aging-related processes [12,13,14]. Given their biological significance, AAPs hold great potential as novel therapeutic agents in anti-aging medicine. Consequently, the accurate identification and discovery of AAPs is of vital importance. However, due to the fragmented and limited availability of related data, few studies have focused on the computational identification of AAPs, despite some researchers conducting analysis related to aging [15, 16]. To date, no dedicated predictor exists to distinguish anti-aging peptides from non-anti-aging ones. Thus, building a comprehensive dataset and developing high-performance predictive models based on such data remains a significant challenge.

To address these challenges, we provided a benchmark dataset of AAPs based on their annotated biological functions and AgingBase database (Table S1) and developed a novel predictive model, Antiaging-FL, which leverages machine learning techniques to accurately predict anti-aging peptides [17]. Experimental results demonstrated that Antiaging-FL achieved robust performance on both the AAP400 dataset and an independent test set. Furthermore, to overcome the limitations posed by the small sample size, which may lead to model overfitting and poor generalization, so we employed two distinct data augmentation strategies. The resulting augmented datasets enabled the construction of predictive models that also performed well on both datasets. Accordingly, we retained the three best-performing models for public use and developed a user-friendly web server, Antiaging-FL, accessible at http://124.222.115.164/# or http://www.stay67.com. We hope this tool will facilitate further research in this domain. An overview of the dataset construction and predictive model development is illustrated in Fig. 1.

Fig. 1

The dataset construction and predictive framework for anti-aging peptides. Peptide sequences were collected from public databases, including AagingBase, peptideDB and Uniprot, to construct benchmark dataset, more details of the procedures are provided in the Methods section.Two data augmentation strategies were employed, including one based on generative adversarial network (GAN), and another based on amino acid conservative substitution. After obtaining the augmented datasets, predictive models were constructed using the original AAP400 dataset and the two augmented datasets. Specifically, a feature representation learning strategy was proposed, where features extracted from 40 manually designed descriptors were used as inputs for building machine learning models. In addition, machine learning models were developed using GAN-augmented datasets combined with ESM-derived features, and deep learning models were constructed based on datasets generated through the amino acid conservative substitution strategy. Further details on dataset construction, data augmentation, feature extraction, feature selection, and model development can be found in the Methods section

Results

Analysis of anti-aging peptides using amino acids composition and machine learning

AAP400 dataset is based on peptide’s annotated biological functions and AgingBase database (Fig. 2A). The analysis of the lengths of anti-aging peptide sequences (Fig. 2B) reveals that the log2 values of the peptide sequence lengths for the positive and negative datasets are 3.0 and 4.8, respectively, with maximum lengths of 80 and 50, respectively. We utilized the number of amino acids in the peptide sequences as an indicator to ensure the acquisition of high-quality data. Subsequently, to discern if certain types of residues are preferred in anti-aging peptides, we compared the amino acid residues present in anti-aging peptides and non-anti-aging peptides, as illustrated in Fig. 2C. The data indicates that there are some significant differences in the content of certain amino acids. In anti-aging peptides, the content of R (Arginine), G (Glycine), P (Proline), and K (Lysine) is notably higher compared to the remaining amino acids. In non-anti-aging peptides, the content of N (Asparagine), I (Isoleucine), K (Lysine), S (Serine), and L (Leucine) is significantly higher than that of other amino acids. Furthermore, the content of R (Arginine), G (Glycine), and P (Proline) in anti-aging peptides is significantly higher than the corresponding amino acid content in non-anti-aging peptides, with a difference of 0.059 for R (Arginine). The content of N (Asparagine) and I (Isoleucine) is significantly lower in anti-aging peptides compared to the corresponding amino acid content in non-anti-aging peptides, with a difference of 0.076 for I (Isoleucine). These findings suggest that significantly different amino acids, such as R (Arginine) and I (Isoleucine), may play a crucial role in distinguishing between anti-aging peptides and non-anti-aging peptides. We employed machine learning methods to validate this hypothesis (Fig. 2D). By extracting the content of the 20 types of amino acids in the peptide sequences as data features to construct the model and outputting the feature importance scores, we found that the top five features were I (Isoleucine), N (Asparagine), Y (Tyrosine), L (Leucine), and R (Arginine), which is highly consistent with the aforementioned analysis results. Through these analysis, we have demonstrated the excellent ability of machine learning methods in assisting in the analysis of the amino acid importance of sequence.

Fig. 2

The sequence analysis of AAP400 dataset. A The negative dataset comprised three categories of peptides, including 1) inflammation-related peptides, which are known to promote aging; 2) bioactive peptide precursors; and 3) cytotoxic peptides. The latter two categories have not been reported in the literature to be associated with aging and were therefore considered unrelated to aging in this study. B The analysis of the lengths of anti-aging peptide sequences (Additional file 1: Table S2). C Amino acid compositional comparison of anti-aging peptides and non-anti-aging peptides. D Machine learning methods assists in the analysis of the amino acid importance of sequence

Comparative analysis of sequence-based feature descriptors using different classifiers

In this study, we employed nine sequence-based feature encoding algorithms including AAC, DPC, TPC, CTD, CTriad, CKSAAP, PSSM, DDE, and PAAC, to extract informative features from peptide sequences [18,19,20,21,22,23,24,25]. To ensure diversity, effectiveness, and uniqueness in feature extraction, we applied several parameter settings: the parameter for AAC was set to 1 or 2; for CTriad, the gap parameter was varied from 0 to 5; and for CKSAAP, the gap parameter was set from 2 to 6. Additionally, both frequency and occurrence counts were considered in EAAC, with its gap parameter ranging from 5 to 10. We also combined AAC with DPC and TPC to enrich the feature representation. Notably, EGAAC and GAAC were included as derived variants of AAC. In total, we generated 40 feature descriptors, as summarized in Table 1.

Using these 40 descriptors, Support Vector Machine (SVM) classifiers were trained on the AAP400 dataset and their performance was evaluated via fivefold cross-validation. As illustrated in Fig. 3A and B, the prediction accuracy of SVM models constructed with individual descriptors ranged from 0.8275 (GAAC) to 0.945 (AAC + DPC). The Matthews correlation coefficient (MCC), which provides a comprehensive measure of model performance, ranged from 0.66 (GAAC) to 0.89 (AAC + DPC). These results indicate that the model based on the combined AAC and DPC descriptors demonstrated superior discriminative capability in distinguishing anti-aging peptides from non-anti-aging peptides. From a biological perspective, the AAC and DPC features capture the global amino acid composition and local dipeptide sequence patterns, respectively. The AAC descriptor reflects the physicochemical properties and overall functional tendencies of the peptide, whereas DPC reveals sequence patterns linked to structural conformation and specific biological activity. These biologically relevant features contributed to the improved predictive performance of the model. Other performance metrics, including sensitivity (0.955), specificity (0.935), accuracy (0.955), and F1-score (0.946), further validated the robustness of the AAC + DPC-based model (Table 1). In addition, excellent performance was observed for models based on the DDE descriptor (accuracy = 0.933; MCC = 0.866) and CKSAAP with k = 2 (accuracy = 0.930; MCC = 0.865).

Fig. 3

Comparison of different feature extraction methods. A-B The performance of 40 handcrafted feature descriptors was evaluated based on accuracy (ACC) and Matthews correlation coefficient (MCC) (Additional file 1: Table S3).** C** Structural features of peptides were extracted and used to construct predictive models with five different machine learning algorithms. Model performance was comprehensively assessed using metrics such as sensitivity, specificity, accuracy, and MCC (Additional file 1: Table S4).** D** High-dimensional features were extracted using the ESM-1b model (650 million parameters), which captures evolutionary information, structural properties, and contextual semantics from peptide sequences. These features were then used for model construction and performance evaluation (Additional file 1: Table S4)

Comparative analysis of structure-based feature descriptors using different classifiers

In this study, we employed seven structure-based feature extraction methods including RSA, ASA, Q3, Q8, phi, psi, and disorder, to capture the structural features of peptides [26,27,28,29]. These features collectively provide a comprehensive assessment of peptide conformation, secondary structure, and residue-level interactions. The extracted structural information was subsequently used for predictive model construction. During data preprocessing, the structural features of each peptide were integrated into a 16-dimensional vector representing a single peptide sequence. Five classical machine learning algorithms were then applied to construct prediction models, including Support Vector Machine (SVM) [30], Random Forest (RF) [31], Logistic Regression (LR) [32], Multi-Layer Perceptron (MLP) [33], and XGBoost [34]. As shown by the experimental results (Fig. 3C), the XGBoost model outperformed the others in terms of Matthews correlation coefficient (MCC), achieving the highest value of 0.838, while the MLP model yielded the lowest MCC of 0.564. Interestingly, the LR model achieved the highest accuracy of 0.895, surpassing XGBoost (0.868), MLP (0.705), RF (0.858), and SVM (0.716). However, considering other key performance indicators such as specificity and sensitivity, the XGBoost model demonstrated more balanced and superior overall performance compared to LR model. It is important to note that some anti-aging peptides in our dataset consisted of fewer than 10 amino acids, which hindered accurate prediction of their three-dimensional structures and, consequently, compromised the reliability of the derived structural features [28]. To maintain data quality, we excluded these unreliable structural samples and proceeded with model training using only the remaining high-confidence data. This decision introduced several limitations, including 1) the reduced dataset increased the risk of model overfitting; 2) the structural features, derived from a partial dataset, could not be aligned with the full set of sequence-based features; and, more critically, 3) the performance of the structure-based models was consistently and significantly inferior to those based on sequence features across all evaluation metrics. Therefore, in subsequent analysis, we discontinued the exploration of structure-based features and focused exclusively on models constructed from sequence-derived features.

The analysis of feature descriptors based on ESM model

Evolutionary Scale Modeling (ESM) is an artificial intelligence-based protein modeling framework designed to learn and generate protein sequences, structures, and functions [35]. Numerous studies have demonstrated the superior performance of ESM in protein feature extraction. In this study, we employed the ESM-b1 model to extract peptide features from the AAP400 dataset. The resulting feature representations were subsequently used as input for predictive model construction based on the Support Vector Machine (SVM) algorithm. As shown in our results (Fig. 3D), the SVM model achieved an accuracy of 0.835 and a Matthews correlation coefficient (MCC) of 0.676, which were lower than those obtained using other feature descriptors, such as the DDE descriptor, which yielded an accuracy of 0.900 and an MCC of 0.810. We further compared the performance of five different machine learning algorithms using the ESM-derived features. As illustrated in Fig. S1, the differences in predictive performance among these models were relatively minor.

Feature selection analysis

We employed a total of 41 feature extraction methods, comprising 40 handcrafted descriptors and one representation derived from the ESM model, to independently construct predictive models. Each model produced a binary classification label (1 for anti-aging, 0 for non-anti-aging) when applied to the AAP400 dataset. The outputs of these 41 models were concatenated to form a new 41-dimensional feature vector for each sample. This vector served as a novel representation of the peptide and was subsequently used to construct a final classification model. This process effectively created a meta-feature space based on prediction outputs, capturing the diverse decision-making behaviors of different feature descriptors with respect to the classification task. To further optimize model performance, we applied the Recursive Feature Elimination (RFE) algorithm, a greedy search method, to select an optimal subset of features from the meta-representation [36]. At this stage, we developed five widely used machine learning models, including Support Vector Machine (SVM), Random Forest (RF), k-Nearest Neighbors (KNN), Logistic Regression (LR), and XGBoost. In the LR model, the optimal feature subset was determined to include four descriptors from our proposed feature representation learning system: CKSAAP (k = 2), CTriad (k = 4), DDE, and AAC + TPC (Fig. 4A). For the RF model, the optimal number of selected features was k = 31 (Fig. 4B). The optimal feature subset dimensionalities for the SVM (Fig. 4C), XGBoost (Fig. 4D**), and KNN models (Fig. 4E)** were determined to be k = 10, k = 8, and k = 5, respectively (Fig. 4F).

Fig. 4

Feature representation learning strategy and feature selection (Additional file 1: Table S5). A-E The optimal feature subsets for each model were determined using the recursive feature elimination (RFE) algorithm. F Visualization of the dimensionality of the optimal feature subsets selected for the predictive models

Classifier comparative analysis based on the optimal features

We compared the performance of machine learning models constructed using their respective optimal feature subsets. As shown in Fig. 5A, the XGBoost model achieved an accuracy of 0.963, while the LR, SVM, RF, and KNN models all attained an accuracy of 0.975. Another key performance metric, the area under the ROC curve (AUC), indicated that the XGBoost model reached a score of 0.98, whereas the KNN model had the lowest AUC at 0.96. The remaining models each achieved an AUC of 0.97. Taking into account the dimensionality of the optimal feature subsets in each model, we determined that the Logistic Regression (LR) model exhibited the best overall performance.

Fig. 5

Performance of the optimal feature descriptor (Additional file 1: Table S5). A Comparison of the predictive performance of the optimal feature vector across five different machine learning algorithms. B In the logistic regression (LR) model, the optimal feature subset was derived from four descriptors within the feature representation learning framework: DDE, AAC + TPC, CTriad (k = 4), and CKSAAP (k = 2). The importance scores of these selected features were also summarized. C Summary of the optimal feature vectors identified in the five predictive models. D Overlapping features shared among the optimal feature subsets of the five models

The optimal feature subset in the LR model was composed of four descriptors derived from the proposed feature representation learning system: DDE, AAC + TPC, CTriad (k = 4), and CKSAAP (k = 2). Feature importance analysis revealed that the features derived from the DDE descriptor contributed more significantly to the model than those from the other descriptors (Fig. 5B). Notably, analysis of the optimal feature subsets across the five machine learning models indicated that CKSAAP (k = 2) was included in the selected subsets of four models, namely LR, KNN, RF, and XGBoost (Fig. 5C). Additionally, the DDE descriptor was found to be a common component of the optimal feature subset in all classification models (Fig. 5D), highlighting its critical role in the identification of anti-aging peptides. The DDE descriptor characterizes sequence-level preferences by quantifying the deviation between observed and expected frequencies of dipeptides in peptide sequences, based on a reference background database. This allows for the identification of potential functional associations within the peptide sequence.

Comparison of data augmentation results with different methods and proportions

Two data augmentation strategies were proposed to expand the dataset and mitigate the risk of model overfitting due to limited training data. The first strategy involved the use of a WGAN-GP model, a variant of the Generative Adversarial Network (GAN), which incorporates a gradient penalty to improve training stability and model performance [37, 38] (Fig. 6A). During training, after approximately 10,000 iterations, three key indicators, including Discriminator Loss, Generator Loss, and Wasserstein Distance, converged to near-zero values (negative for the discriminator, positive for the generator and Wasserstein distance), indicating that the model had reached a stable state (Fig. 6B). The augmented data were subsequently used to construct predictive model. The results showed that the Logistic Regression (LR), XGBoost, Multi-Layer Perceptron (MLP), and Random Forest (RF) models all achieved accuracies above 0.9, specifically 0.929, 0.946, 0.946, and 0.937, respectively (Fig. 6C). In terms of MCC, the values were 0.862 for LR, 0.898 for XGBoost, 0.895 for MLP, and 0.877 for RF. Compared to models trained without data augmentation, the average accuracy and MCC across the five models improved by 0.085 and 0.166, respectively. Among all models, XGBoost exhibited the most favorable overall performance based on a comprehensive comparison of sensitivity, specificity, accuracy, MCC, recall, and F1-score (Fig. 6D).

Fig. 6

Performance comparison of predictive models based on GAN-augmented data. A The overall architecture and training workflow of the Wasserstein GAN with Gradient Penalty (WGAN-GP). B Training loss curves of WGAN-GP, including the discriminator loss, generator loss, and the Wasserstein distance, which quantifies the divergence between the distribution of generated samples and that of real data. C Predictive models were constructed using GAN-augmented datasets. The performance of five machine learning algorithms was evaluated and compared based on metrics including sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC) (Additional file 1: Table S5). D Visualization of the XGBoost model, which achieved the best overall performance among the evaluated models (Additional file 1: Table S6)

In accordance with the strategies described in References [39,40,41], we also implemented a second data augmentation approach based on conservative amino acid substitutions. In this method, amino acids were grouped by physicochemical properties, such as acidity and polarity, enabling biologically equivalent replacements in peptide sequences (Fig. 7A). Unlike other augmentation strategies, this method avoids insertions or deletions and preserves the overall biochemical characteristics of the sequence [39]. Additionally, it significantly altered the latent distribution of the data during deep learning modeling, as illustrated in Fig. S2. Subsequently, we constructed predictive models using both the handcrafted feature descriptors and ESM-based representations at various augmentation levels (Fig. 7B). When using the 40 handcrafted descriptors, model performance improved most noticeably at the 3 × augmentation level. As the dataset size increased further, model performance plateaued, reaching optimal values at the 9 × augmentation level, with average accuracy and MCC reaching 0.973 and 0.947, respectively. In models built using ESM-derived features, the SVM model’s accuracy increased from 0.835 to 0.914, and its MCC from 0.676 to 0.830, as the size of the augmented dataset increased. Other metrics also showed notable improvements: sensitivity rose from 0.900 to 0.944, specificity from 0.770 to 0.884, and F1-score from 0.849 to 0.916. We further compared the performance of five machine learning models including SVM, MLP, RF, XGBoost, and LR, across varying levels of data augmentation (Fig. 7B). At the 1 × augmentation level, the LR model demonstrated the best performance, with an accuracy of 0.914 and an MCC of 0.833, outperforming the other four models. However, as the degree of data augmentation increased, XGBoost began to outperform the others, achieving the highest accuracy and MCC of 0.993 and 0.986, respectively, at the 9 × augmentation level. Nevertheless, it is important to note that when using the conservatively substituted peptide dataset for traditional machine learning modeling, there remains a risk of model overfitting.

Fig. 7

Performance comparison of predictive models based on amino acid conservative substitution data augmentation. A Rules of amino acid conservative substitution. New peptide sequences were generated by substituting amino acids in the original sequences according to conservation rules, thereby achieving data augmentation. B The performance of predictive models constructed under different levels of augmentation was compared. Additionally, models using ESM-based feature extraction were compared with those using handcrafted features derived from 40 traditional descriptors (Additional file 1: Table S7)

Performance analysis of classifier constructed based on deep learning

Deep learning models typically require large-scale datasets for effective training [42, 43]. Therefore, we utilized the 10 × augmented dataset as input to train a deep learning model. The model architecture consisted of an embedding layer based on ESM-derived sequence features, followed by two convolutional neural network (CNN) layers and a multi-head attention mechanism [44]. During model training, the number of epochs was set to 5000, and dropout as well as L2 regularization were applied to mitigate overfitting. According to the results of fivefold cross-validation, the accuracy curve exhibited a rapid increase in the early training stages and began to stabilize after approximately 2000 epochs. Eventually, the validation accuracy converged to above 90%, indicating successful model convergence (Fig. 8A-E). The final trained model achieved an AUC of 0.96, demonstrating strong predictive performance (Fig. 8F). Additionally, we explored an alternative embedding strategy by integrating features derived from DDE, AAC_TPC, CKSAAP (k = 2), and CTriad (k = 4). A deep learning model was constructed based on this fused feature embedding. However, the experimental results indicated that the final model’s accuracy remained below 80%, with an AUC of 0.90 (Fig. S3), which was inferior to the performance of the model using ESM-based representations.

Fig. 8

Construction of deep learning models based on data augmentation. A-E These subplots A-E show the validation accuracy across training epochs for each fold during cross-validation. The accuracy tends to stabilize after approximately 2000 epochs, ultimately converging to a validation accuracy above 90% in all folds. F The overall receiver operating characteristic (ROC) curve, which evaluates the model’s classification performance across various thresholds. The area under the curve (AUC) is 0.96, indicating excellent overall discriminative ability of the model

Prediction performance comparison with other models

In this study, we compared the performance of the optimal feature vector-based model with four individual feature descriptor-based models, as well as its predictive performance against the ESM_9x, ESM_GAN, and CNN_9x models on an independent test dataset. The results of fivefold cross-validation on both the AAP400 dataset and the independent test set are summarized in Table S8. As shown (Fig. 9A), the optimal feature vector model outperformed all four descriptor-based models across all performance metrics, including accuracy (ACC), sensitivity (SE), specificity (SP), Matthews correlation coefficient (MCC), recall, and F1-score. Specifically, the optimal model achieved the highest ACC and MCC values of 99.7% and 0.995, respectively, which were significantly higher than those of CKSAAP (k = 2) (ACC = 88%, MCC = 0.77), CTriad (k = 4) (ACC = 84%, MCC = 0.68), AAC + TPC (ACC = 86%, MCC = 0.72), and DDE (ACC = 90%, MCC = 0.81). As illustrated in Fig. 9A, the area under the ROC curve (AUC), a commonly used metric for evaluating overall classification performance, was also higher for the optimal feature vector model, indicating superior predictive ability. This suggests that the optimal features provide more accurate discrimination between true anti-aging and non-anti-aging peptides than their original descriptors. This improvement may be attributed to the feature representation learning framework, which integrates the outputs of multiple heterogeneous models to reconstruct features, thus implicitly achieving feature fusion and enhancement. Similarly, results on the independent test set confirmed that the optimal feature descriptor yielded better performance than the individual descriptors, achieving an AUC of 0.99 (Fig. 9B). Importantly, the dimensionality of the optimal feature vector was only 4, significantly lower than that of the original descriptors (Fig. 9C). Additionally, as shown in Table S10, the learned 4-dimensional feature vector exhibited superior performance across five out of six major evaluation metrics, including SN, MCC, recall, F1-score, and ACC, compared to the original descriptors, with SP being the only exception.

Fig. 9

Performance and comparison of prediction models on AAP400 and independent test datasets (Additional file 1: Table S8). A Comparing the performance of the optimal feature vector with the original four descriptors using fivefold cross-validation on the AAP400 dataset. B Comparing the performance of the optimal feature vector with the original four descriptors using fivefold cross-validation on the test dataset. C The dimensions of the optimal feature vector and the original four descriptors. D The area under the receiver operating characteristic curve (AUC) was employed as the primary evaluation metric to systematically compare the performance of four models, including Antiaging-FL, ESM_9x, ESM_GAN, and CNN_9x, in the task of anti-aging peptide identification. A higher AUC value indicates a stronger discriminative ability of the model in distinguishing between positive and negative samples

We further compared the performance of the optimal feature vector-based model with that of the ESM_9x, ESM_GAN, and CNN_9x models on the independent test dataset. The ESM_9x model was built using a 9 × augmented dataset generated by the conservative amino acid substitution method, with ESM features used for peptide representation. The ESM_GAN model was constructed from a GAN-augmented dataset (2 × the original size), also using ESM-based features. The CNN_9x model employed the same 9 × amino acid substitution-based augmented dataset, using ESM features as input to a deep learning model with an embedding layer. The results showed that the Antiaging-FL model achieved the highest AUC score of 0.99, outperforming ESM_9x (0.95), ESM_GAN (0.95), and CNN_9x (0.94) **(**Fig. 9D). These findings indicate that the Antiaging-FL model possesses superior predictive capability compared to other constructed models. However, it is worth noting that the Antiaging-FL model was trained on a relatively small dataset, which may lead to overfitting. Therefore, we retained both the ESM_GAN and CNN_9x models as additional reference models for users.

Prediction of potential anti-aging peptide using public data

Peptides intervene in the aging process through various mechanisms, including anti-cancer, antimicrobial, anti-inflammatory, antioxidant, and immunomodulatory pathways [45,46,47]. Therefore, we aimed to investigate whether peptides with functions such as anticancer and antimicrobial properties also possess anti-aging capabilities. We collected peptides with corresponding functions, including antimicrobial peptides, anticancer peptides, IL-13 inducing peptides, antimalarial peptides, antioxidant peptides, and anti-inflammatory peptides, to analyze their potential anti-aging function. The antimicrobial peptides were primarily sourced from the Antimicrobial Peptide Database, which includes 3146 natural antimicrobial peptides [48]. We analyzed them using the constructed prediction model, revealing that 23% (729/3167) of the antimicrobial peptides were predicted to be potential anti-aging peptides (Fig. 10A). This is understandable, as antimicrobial peptides possess broad-spectrum antibacterial activity, capable of combating pathogenic microorganisms by not only directly killing bacteria but also modulating the host’s immune response. As aging progresses, the immune system’s functionality declines, increasing infection likelihood. Antimicrobial peptides may exert anti-aging effects by enhancing the immune defense system. Additionally, some antimicrobial peptides are multifunctional, possessing anti-inflammatory and antioxidant properties, thereby potentially slowing aging, ameliorating various age-related diseases. IL-13 inducing peptides generally function within the immune system. We downloaded IL-13 inducing peptide sequences from the iIL13Pred web server [49] and input them into the model for prediction, showing that 32.9% (103/313) of the sequences were classified as anti-aging peptides (Fig. 10B). Similarly, we obtained anticancer peptides from the ACPred-FL web server [50], antimalarial peptides, antioxidant peptides, and anti-inflammator

Background

Background

Results

Analysis of anti-aging peptides using amino acids composition and machine learning

Comparative analysis of sequence-based feature descriptors using different classifiers

Comparative analysis of structure-based feature descriptors using different classifiers

The analysis of feature descriptors based on ESM model

Feature selection analysis

Classifier comparative analysis based on the optimal features

Comparison of data augmentation results with different methods and proportions

Performance analysis of classifier constructed based on deep learning

Prediction performance comparison with other models

Prediction of potential anti-aging peptide using public data

Similar Posts