Introduction
CRISPR has been well established as a key genome editing tool, with Cas9 as the most widely used CRISPR enzyme. Classical CRISPR-Cas9-based gene editing relies on the introduction of double-stranded DNA breaks (DSB) guided by the small guide RNA (gRNA). Repair of the CRISPR-induced DSBs in cells leads to a variety of possible insertions or deletions (indels) in the DNA. This challenge has been addressed by engineering next-generation CRISPR editors such as base editors (BE)1 and prime editors (PE)[2](https://www.na…
Introduction
CRISPR has been well established as a key genome editing tool, with Cas9 as the most widely used CRISPR enzyme. Classical CRISPR-Cas9-based gene editing relies on the introduction of double-stranded DNA breaks (DSB) guided by the small guide RNA (gRNA). Repair of the CRISPR-induced DSBs in cells leads to a variety of possible insertions or deletions (indels) in the DNA. This challenge has been addressed by engineering next-generation CRISPR editors such as base editors (BE)1 and prime editors (PE)2. The BE consists of a recombinant Cas9 nickase and an attached deaminase that enables introducing single nucleotide mutations without double-stranded DNA cleavage3,4. Similar to the standard Cas9, BEs make use of the gRNA, for which its 20 nucleotide (nt) spacer must be complementary to the DNA target site (protospacer). Here, we focus on adenine base editors (ABEs), converting A•T into G•C base pairs, and cytosine base editors (CBEs), converting C•G into T•A base pairs, which allow for conversion of a single base pair, but “bystander” bases in an approximately 8nt window may be edited as well5. The bystander edits result in different outcomes for a given gRNA. Hence, for base editing to be widely usable, design tools are required to accurately predict frequencies of all outcomes and the gRNA efficiency accurately. Thus, the prediction task extends the common task of predicting the gRNA efficiency.
Within the last few years, tens of thousands of ABE and CBE gRNA efficiencies and corresponding outcomes have been evaluated in cells6,7,8,9,10,11,12,13. However, the availability of datasets for building machine learning prediction models is still limited, and the diverse properties of deaminases6,10,14,15 and libraries generated by various sources pose a challenge for the use of current datasets. In addition, these datasets are also challenged by the inclusion of many low-efficiency gRNAs. Thus, generating more datasets is crucial for the development of better prediction models. Furthermore, since the datasets are not directly compatible, combining them is not straightforward, and all existing methods6,7,8,9,10,11,12,13 are based on a single dataset each with the given limitations.
Here, we apply our previously established lentiviral gRNA-target pair library technology (SURRO-seq)16,17 to measure base editing efficiency on a massive scale for two base editors, ABE (ABE7.10) and CBE (BE4-Gam), in HEK293T cells. For each base editor, we assessed editing efficiency and outcomes of approximately 11,500 gRNAs using SURRO-seq. Upon integration with published datasets including ABE7.10, ABE8e and BE4, totaling 17,941 gRNAs for ABE and 19,010 gRNAs for CBE, we developed two integrated deep-learning models CRISPRon-ABE and CRISPRon-CBE, for predicting gRNA editing efficiency and outcome frequency from ABE and CBE, respectively. Deep learning models are trained simultaneously on all data in a procedure where the source of each data point is kept track of. In this way, prediction on a specific base-editor (dataset) benefits from the others and allows for predictions based on a weighted combination of the datasets used for training. Our models simultaneously predict gRNA efficiency and outcome frequency and their performances are evaluated by two-dimensional Pearson and Spearman rank correlation coefficients (R**218 and ρ**2, see “Methods” for details). Benchmarking our models on independent test sets shows superior performance compared to existing tools. Additionally, feature analysis reveals that predicted CRISPR-Cas9 efficiency plays an important role in base editor efficiency prediction. A webserver based on CRISPRon-ABE/CBE for predicting gRNA efficiency and outcome frequencies is available via https://rth.dk/resources/crispr/.
Results and discussion
Massive quantification of ABE and CBE editing efficiency and outcomes in cells
We first sought to generate data with more homogeneous ABE and CBE editing efficiency and outcomes in cells using the gRNA-target pair library technology developed previously16,17. We transduced HEK293T-ABE (ABE7.10) and HEK293T-CBE (BE4-Gam) cells with a 12 K gRNA-target pair library with a multiplicity of infection (MOI) of 0.3, with an estimated coverage of approximately 1000 (Fig. 1A). We harvested the transduced cells eight days after growing in selective medium: puromycin (for selecting cells with the integration of the gRNA-target pair construct) and doxycycline (for induction of ABE and CBE expression). Deep amplicon sequencing was performed using surrogate target site-specific PCR products (coverage = ~ 2000 reads per gRNA) to capture the editing efficiencies and outcomes (Fig. 1A). After removing low quality gRNAs (<100 reads), we obtained 11,484 gRNAs covered in the ABE set and 11,406 gRNAs covered in the CBE set (Fig. 1B and Supplementary Data 1). We also investigated the strictness and editing window (position 3-10 in the protospacer) for ABE7.10 and BE4 using our large amount of in-cell measured data. As shown in Fig. 1C, ABE7.10 exhibited highly strict Adenine (A) to Guanine (G) transition. 97% of all edited sites were A•G transition. Unlike ABE7.10, the BE4 exhibited less strict Cytosine (C) to Thymine (T) transition (92%). Low frequency (3.5%) of C•G transversion was observed for BE4. Although Cas9 nickase was used for both ABE7.10 and BE4, low indel frequency (2.06% for ABE7.10; 2.96% for BE4) was also observed. When analyzing the editing window, the narrower core editing window (position 4-8 in the protospacer) exhibits high editing efficiency for both ABE7.10 and BE4 (Fig. 1D). We compared the ABE7.10 and BE4 editing efficiency with the Cas9 from Streptococcus pyogenes (SpCas9) indel frequency obtained for all these sites. Our results showed that the editing efficiency exhibited positive correlation (R = 0.40 to 0.54 for ABE7.10, R = 0.29 to 0.49 for BE4) between base editing efficiency and SpCas9-induced indel frequency (Supplementary Fig. 1). We also performed sequence motif analysis similar to that of Arbab et al. 6 to explore how nucleotides flanking the edited sites affect base editing efficiency using the three nucleotides flanking the editing sites within the core editing window. ABE7.10 and BE4 exhibit high editing activity of 5’TAC (A in bold is the edited site) and 5’TC (C in bold is the edited site), whereas ABE7.10 and BE4 fail to edit sites with 5’AAA and 5’GC, respectively (Fig. 1E). In conclusion, we generated experimental data for over 11,000 gRNAs in cells, comprising SpCas9 indel frequency, ABE (ABE7.10) base editing activity, and CBE (BE4) base editing activity.
Fig. 1: High throughput generation of ABE and CBE editing data in cells.
A Illustration of experimental setup. HEK293T-ABE7.10 and HEK293T-BE4 are referred to doxycycline (Dox) inducible ABE (ABE7.10) and CBE (BE4) human embryonic kidney 293T (HEK293T) cells, respectively. The lentiviral 12 K gRNA and target pair library contains 12,000 gRNAs, generated previously in our CRISPRon study17. B Histogram distribution of base editing efficiency (% edited sites out of the total sites in the target site) for ABE7.10 (top) and BE4 (bottom). C Quantification of strictness for all editing events detected in the surrogate target sites for ABE7.10 and BE4. D Bar plot of substitution fractions from the edited site (A for ABE7.10, and C for BE4) for all target sites at each position, including the flanking bases, protospacer, and PAM. PAM, protospacer adjacent motif. The distal base pair is defined as 1 (N1). E Sequence motif for the three flanking sites of the edited base for ABE7.10 (left) and BE4 (right). Logistic regression weight is provided as positively or negatively correlating with base editing efficiency. Source data are provided as a Source Data file.
Dataset aware deep learning model improves prediction of base editor efficiency and outcome frequency
In contrast to existing models, which predict only outcome frequency and from that infer the gRNA efficiency8, we develop deep learning models to predict both simultaneously from the 30-nucleotide (nt) input DNA target sequence (Supplementary Fig. 2 and Supplementary Note 1). We included the 30nt DNA sequences (20nt protospacer + 3nt PAM + 4nt and 3nt flanking sequences at the 5’ and 3’ ends respectively), gRNA-DNA binding energy ∆GB19, previously adopted in CRISPRon, labeling target nucleotide editing positions for different outcomes and the predicted Cas9 efficiency computed by a CRISPRon version trained on gRNAs not overlapping the test set17 (Supplementary Note 2). We only consider editing outcomes within the 8nt editing window for both ABE and CBE in the construction of the prediction model (Methods). Since there is a correlation between Cas9-induced indel frequency and base conversion efficiency (Supplementary Fig. 1), making use of the predicted Cas9 efficiency is beneficial for predicting the base editing gRNA efficiency and outcome frequency. This is further supported through a SHAP analysis20 (Supplementary Note 3 and Supplementary Fig. 3) and an ablation analysis by sequential removal of features to train the model (Supplementary Fig. 4).
Before constructing prediction models that can simultaneously make use of multiple datasets, we trained our model on individual datasets on this architecture (Supplementary Fig. 2). We collected different datasets from two base editors (ABE7.10 and BE4) for training: the SURRO-seq data from this study, the Song Train and Test datasets (Song datasets)7 and the Arbab dataset6 for ABE and CBE, respectively. Moreover, we included two ABE datasets (“SpCas9-ABEmax” and “SpCas9-ABE8e” from Kissling et al. 12, here referred to as “Kissling ABE7.10” and “Kissling ABE8e”) (Methods). We trained our model using a 5-fold cross-validation strategy, with a 6th partition reserved for independent testing. See detailed training process and dataset separation in the Methods and Supplementary Data 2. Since we have two outputs, gRNA efficiency and outcome frequency, we used two-dimensional Pearson and Spearman rank correlation coefficients R**218 and ρ**2 introduced here (Supplementary Note 4) for the combined evaluation of gRNA editing efficiency and outcome frequency predictions. Evaluation on independent test sets illustrates that our deep learning architecture is effective in solving the base editor efficiency prediction problem (Supplementary Fig. 5). We also notice that when training our model solely on the Song Train data, we obtained higher performance on the test set (Song Test) than their model DeepABE/CBE (Supplementary Fig. 5).
Upon inspecting the datasets (Fig. 2, Supplementary Fig. 6, Supplementary Note 5) we noticed that the Kissling ABE8e dataset predominantly exhibits high editing efficiencies between 60% and 80%. In contrast, the Kissling ABE7.10 dataset targets the same gRNA sequences but shows lower efficiencies from 30% to 50%. The ABE8e was engineered based on the ABE7.10 using phage-assisted evolution selection, which contains 8 extra mutations in the TadA-7.10 domain and exhibits 3-fold to 11-fold increases in editing levels15.
Fig. 2: Variety of dataset from diverse sources.
A, B Distributions of gRNA editing efficiency for ABE (A) and CBE (B); C Summary of diverse datasets from different studies. The column “#Bystander edits” shows the number of edited nucleotides within the editing window from a single outcome observed in experiments. Source data are provided as a Source Data file.
To train prediction models on multiple datasets simultaneously (Fig. 3A, Supplementary Fig. 7), we use the provided output values as is from the respective datasets but label them individually by including a vector of size equal to the number of datasets being trained on. A position in the vector have value 1 to indicate the dataset origin of a specific input-output example, while the other positions are 0. For the ABE models the order is: SURRO-seq, Song Train, Arbab, Kissling ABE7.10, and Kissling ABE8e. For the CBE models, the order is: SURRO-seq, Song Train, and Arbab. When using the models for prediction we have the option of using the positions in the vector as weights (see Supplementary Table 1 for detailed model tuning strategy). Users can assign full weight (100%) to a specific training set by setting a value of 1 at the corresponding data index, or allocate fractional weights across multiple datasets, with the total sum constrained to 1. For example, the encoding “10000” (for ABE) or “100” (for CBE) corresponds to models tuning exclusively on the SURRO-seq dataset. The encoding “22000” refers to an ABE model with equal weight (50% or 1/2) on the SURRO-seq and Song datasets, and no weights on the others. Note the number indicated a given position at the labeling vector corresponds to the denominator of the fraction representing the weight. Evaluating the performance on several independent test sets reveals consistent results for fused models tuned on different datasets (Fig. 3B, C). Comparing with established base editor efficiency prediction methods, our models demonstrate superior performance on independent test sets (Fig. 3B, C, Supplementary Fig. 5, 8–12 and Supplementary Data 3). Notably, all methods yield higher performance on the Song Test set showing that this set for some reason is “easier” than other test sets.
Fig. 3: Schematic representation of CRISPRon-ABE and CRISPRon-CBE prediction algorithm.
A The inputs of the models are the one-hot encoded 30nt target DNA sequence, labeling of the edited nucleotide positions within the editing window from the outcome sequence, the binding energy (ΔGB), the predicted Cas9 efficiency, and the labeling of the dataset source. For ABE models, the dataset labels follow the order: SURRO-seq, Song Train, Arbab, Kissling ABE7.10, and Kissling ABE8e. For CBE models, the order is: SURRO-seq, Song Train, and Arbab. B, C Performance comparison of CRISPRon-ABE and CRISPRon-CBE together with other existing models on independent test sets. The fused model makes use of numerical encoding corresponding to the weight given to each of them. Specifically, “1” represents giving the 100% weight to the corresponding dataset (SURRO-seq, Song Train, Arbab, Kissling ABE7.10 or Kissling ABE8e dataset), while “2”, “3”, “4” represent giving 50% (1/2), 33% (1/3) or 25% (1/4) weight to the corresponding dataset. For example, the encodings or labeling “10000” and “100” refer to models that are tuned to SURRO-seq for ABE and CBE, respectively. The encoding “22000” refers to a model which gives equal weight (50%) to the SURRO-seq and the Song datasets and no weight (0%) to the other ABE datasets. Similarly, the encoding “333” refers to equal weight (33%) on all three CBE datasets. N.a. means not available due to lack of explicit train-test separation. The two-sided Steiger’s test p-value was used to compare correlation coefficients between two models. * represents *p-*value <0.05, ** represents *p-*value <0.01, *** represents p-value <0.001. The p-values for ABE7.10 evaluated by R**2 (from top to bottom) are 0, 0, 0, and 1.73e-6, respectively. The p-values for ABE7.10 evaluated by ρ**2 (from top to bottom) are 0, 0, 0, and 6.66e-16, respectively. The p-values for ABE8e evaluated by R**2 and ρ**2 are 0 and 1.06e-13, respectively. The p-values for BE4 evaluated by R**2 (from top to bottom) are 0, 2.34e-6, and 0, respectively. The p-values for BE4 evaluated by ρ**2 (from top to bottom) are 0, 7.88e-3, and 0.42, respectively.
Due to the data heterogeneity and differences in base editor efficiency, our results suggest that the weighting decisions should be made dependent on the base editors and the technology for concrete editing. For instance, if the gRNA is designed for ABE8e, we suggest giving a weight of 100% to Kissling ABE8e dataset by CRISPRon-ABE (00001). If the gRNA is designed for ABE7.10 or BE4 and the editing is from SURRO-seq, Song, Arbab or Kissling platform, we suggest giving a weight of 100% to the corresponding dataset. If the gRNA is designed for ABE7.10 or BE4 but the platform for base editing is not clear, we suggest considering the scaffold sequence. If the scaffold is same as the one used by Arbab and Kissling ABE7.10 dataset with an extended stem and modified nucleotide21, we suggest giving weights of 50% to both Arbab and Kissling ABE7.10 datasets for ABE (CRISPRon-ABE(00220)) or giving weights of 100% to the Arbab dataset for CBE (CRISPRon-CBE(001)). Otherwise, we suggest giving weights of 50% to both SURRO-seq and Song datasets (CRISPRon-ABE(22000) and CRISPRon-CBE(220)).
We observed that the model performance on the independent test sets declined by approximately 10% when dataset labeling was omitted during training (Supplementary Fig. 4). This observation is consistent with the datasets being diverse and supports the importance of dataset awareness when developing models using combined and diverse datasets.
The limited variety of base editors included in the CRISPRon-ABE/CBE also impacts the ability to construct predictors that apply broadly in base editor field. So far, the models have been developed using ABE7.10, ABE8e and BE4 base editors. Considering the distinct sequence motif preferences from different deaminases6,10,22, accurate prediction of editing outcomes for other base editors using CRISPRon-ABE/CBE remains a challenge. For example, newly developed base editors like ABE8.20m23 and CBE6b14 have not been incorporated into our deep-learning models or even in benchmarking analyses due to the limited publicly available datasets. This emphasizes the need and promise of generating large-scale editing dataset using e.g. SURRO-seq for these newly evolved base editing enzymes. In addition, our current model is only trained on the data harvested in HEK293T cell line. Systematic benchmarking indicates that CRISPRon-ABE/CBE achieves comparable performance across cell lines in Arbab dataset (Supplementary Fig. 12). Given the more diverse properties across different platforms for data generation (Fig. 2 and Supplementary Fig. S6), the diversity of base editing across cell lines is not accounted for in our current model. This highlights the need for more data.
We conclude that our fused models (CRISPRon-ABE and CRISPRon-CBE) consistently demonstrate significantly improved prediction. The labeling of the datasets used during training can, in a design context, be replaced by weights allowing emphasis on a specific dataset, e.g., one that uses a BE closer to the one of interest to the user. Our analysis showed that both more data and data integration in the prediction model are important to further improve the performance. In addition, our dataset integration strategy not only proves effective for combining datasets from multiple base editors but also holds potential for broader applications in other CRISPR technologies. For instance, integrating cell-line information when training models on prime editing datasets with same edit types such as short indels but measured in different cell lines may enhance predictive accuracy. The CRISPRon-ABE/CBE models are available as stand-alone software and as a webserver via https://rth.dk/resources/crispr/.
Methods
Cell culture and doxycycline-inducible ABE/CBE stable cell lines generation
Wildtype HEK293T cells from ATCC (catalog number CRL-3216) were cultured in DMEM media supplemented with 10% FBS and 1% penicillin-streptomycin (D10 medium) at 37 °C with 5% CO2. Routine mycoplasma testing using PCR (cat no. PM008) yielded negative results the cells used for this study. TRE-ABE7.10 and TRE-CBE (BE4) stable cell lines were generated using PiggyBac transposon systems. HEK293T cells were transfected with pPB-ABE7.10-hygromycin (modified from the Addgene#102919), pPB-CBE-hygromycin (modified from the Addgene#100806) vectors, and pCMV-hybase at a 9:1 ratio. HEK293T-ABE7.10 and HEK293T-BE4 cells were cultured in selection medium containing 50 μg/mL hygromycin.
Lentivirus transduction with the 12 K library
HEK293T-ABE7.10 and -BE4 cells were cultured in D10 medium supplemented with 50 μg/mL hygromycin throughout the experiment. The 12 K library was generated previously17. For the lentiviral 12 K CRISPR gRNA-target pair library transduction, the following steps were carried out: 1) Day −1: 2.5 × 10^6 cells per 10 cm dish were seeded in 12 dishes. One dish per group was allocated for cell number determination pre-transduction, one for drug-resistance (puromycin) control, and the remaining 10 dishes for lentivirus library transduction; 2) Day 0: Cell count was performed to estimate cell number per dish, determining the volume of lentivirus for transduction with a MOI of 0.3. Low MOI (0.3) ensured most infected cells received only one copy of lentivirus. Infected cells were cultured in D10 medium at 37 °C; 3) Day 1: Transduced cells were split into three dishes per original dish; 4) Day 2: Sub-group 1 (10 dishes) was harvested and labeled as Day 2 post-12K library transduction, with cells pooled for genomic DNA extraction. Sub-group 2 (10 dishes) received fresh D10 medium with 50 μg/mL hygromycin + 1 μg/mL puromycin (Dox-free). Sub-group 3 (10 dishes) received D10 medium with 50 μg/mL hygromycin + 1 μg/mL puromycin + 1 μg/mL doxycycline (Dox induction of base editor expression). For WT HEK293T cells (Group 3) screening, hygromycin but not puromycin was used in the culture medium; 5) Transduced cells were split every 2–3 days when confluence reached 90%. 6) The transduced cells were harvested for further genomic DNA extraction and targeted deep sequencing eight days after growing in the Dox selection medium.
Targeted deep sequencing
Genomic DNA from transduced cells was extracted using the phenol-chloroform method and subsequently purified. PCR amplification of the surrogate target sites was carried out using the TRAP-NGS-F1 and TRAP-NGS-R1 primers (sequences listed in Supplementary Table 2). Each PCR reaction contained 5 μg of genomic DNA as a template, yielding approximately 7.6 × 10^5 copies of the gRNA-target pair construct, providing about 63-fold coverage of the 12 K library. To achieve comprehensive coverage, 32 parallel PCR reactions were conducted, resulting in approximately 2016-fold coverage of each construct. The PCR products were then purified using a 1.5% gel and pooled in equal amounts for subsequent deep amplicon sequencing. The amplicon libraries were subjected to deep sequencing on the MGISEQ-2000 (MGI of BGI in China) platform. All the samples were subjected to paired-end 150 bp deep-sequencing on the MGISEQ-2000 platform.
Raw data processing
The raw data were sequenced using the MGISEQ-2000 platform (G400), generating PE150 paired-end reads. FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/, version: 0.11.3) and fastp (https://github.com/OpenGene/fastp, version: 0.19.6)24 were used for data quality control and filtering with the default parameters. The paired-end data were assembled using flash (http://www.cbcb.umd.edu/software/flash, version: 1.2.11)25. BWA-MEM (version: 0.7.17)26 was used to map the assembled data to the designed oligo sequences to preliminarily distinguish the data of each synthetic gRNA-target pair surrogate site. The oligo sequences have been reported in the previous CRISPRon study and are available in Supplementary Data 117.
Data filtering
The pysam module of Python-3.8 was used to split the aligned data according to the site number of the chip, and the reads of different sites were obtained. Then, we used three steps of strictly controlling parameters to filter the data of each site. Firstly, according to the structure of the chip, g + gRNA (20 bp) + scaffold (82 bp) + GTTT should remain unchanged at the beginning and end of each site. Then, to remove the chip synthesis errors, the pseudo-editing sequences found in WT HEK293T cells transduced with the 12 K SURRO-seq library were removed from both ABE and CBE groups. The above three filtering steps were completed with Julia-1.5.3 language.
Calculation of base editing efficiency
To improve the suitability for machine/deep learning, the SURRO-seq dataset was further processed and base editing efficiency and outcome editing efficiency were calculated separately. Base editing efficiency is relative to the number of sites edited, and for each site, the frequency of mutating (number of mutations/total number of reads) was calculated, including the frequency of occurrence of indels, although at a lower percentage. To mitigate the impact of enrichment and depletion, we kept this subset of the dataset with a total number of reads greater than or equal to 100, resulting in 11,484 gRNAs for ABE and 11,406 gRNAs for CBE. The outcomes for all edited results were calculated by masking indel loci as wild type. The frequency (number of mutations per outcome/total number of reads) and proportion (number of mutations per outcome/number of reads in which mutations occurred) were then calculated separately.
Sequence context analysis for base editing efficiency
To explore the impact of proximal sites on the editing efficiency of base editors, we constructed sequence motifs for both ABE and CBE using our SURRO-seq dataset. We employed a logistic regression model to predict gRNA editing efficiency and generated sequence logos based on the obtained weights. Initially, we selected sequences containing the target nucleotides (“A” for ABE and “C” for CBE) along with their flanking nucleotides ( ± 3nt). These sequences were then encoded using a one-hot encoding scheme for efficiency prediction. The gRNA editing efficiency was coded as 0 or 1 to facilitate binary classification modeling. If the editing efficiency exceeded the mean value at certain position, the target value was 1; otherwise, it was 0. We subsequently utilized a logistic regression model for efficiency prediction, employing 5-fold cross-validation by keeping highly similar target sequences (≤ 4 mismatches for a 30nt target sequence) in a single fold. Then model performance was assessed on the 6th partition with equal size reserved for independent testing using Pearson correlation coefficient between actual and predicted efficiency values. Additionally, we extracted weights from the logistic regression model to visualize the effect of sequence context on base editor efficiency. Given the high efficiency observed from positions 4 to 8 within the spacer, we focused our motif analysis on target sequences with target nucleotides located within these positions.
Data collection and pre-processing for machine learning
In addition to our SURRO-seq dataset, we collected datasets for both ABE and CBE: Song Train/Test from Song et al., 7 Arbab dataset from Arbab et al. 6 (although there is no information about which part of the data was used for their training and testing) and Marquart Test from Marquart et al. 8. We also collected two ABE datasets (Kissling ABE7.10 and ABE8e) from Kissling et al. 12. Note that no detailed information was provided regarding which data were used for training and testing in the original study. In total, three base editors are involved in this study: ABE7.10 (including SURRO-seq, Song, Arbab, Marquart, Kissling ABE7.10 dataset), ABE8e (Kissling ABE8e dataset) and BE4 (SURRO-seq, Song, Arbab, Marquart dataset). In addition, we not only collected the data evaluated in HEK293T cells but also the Arbab dataset from other cell-lines, including mESC (Arbab mESC Test) and U2OS (Arbab U2OS Test) to explore the model performance across cell lines. Further details about the datasets are given in Supplementary Table 3.
The datasets were processed as follows: 1) Removal of gRNAs without “NGG” PAM at the DNA target sites; 2) Removal of gRNAs for which the corresponding target sites are supported by less than 100 total reads; 3) Removal of gRNAs with over 100,000 total reads, considering that their total reads are higher than other gRNAs; 4) Removal of gRNAs not containing the nucleotide A (ABE) or C (CBE) within the editing window (positions 3 to 10); 5) Exclusive retainment of outcomes with intended target nucleotide transitions (A > G for ABE, C > T for CBE) within the editing window, considering that only few reads supported transitions to unintended nucleotides, such as A > Y for ABE and C > R for CBE and the transitions occur mainly in the editing window (Fig. 1C, D). The outcomes from one target DNA sequence were further grouped according to the positions of the transitioned nucleotides in the editing window and the frequencies within each group were summed; 6) Generation of outcomes at target sites with the potential to be edited but without any reads from the experimental support, by setting their outcome frequency to 0; 7) Removal of gRNAs for which the corresponding target site is supported by fewer than 100 edited reads, as in previous studies6,7. As we observed in the Song datasets that the sum of the number of reads for the outcomes edited by a specific gRNA can be higher than the total number of reads reported for this same gRNA, we recalculated the outcome frequencies accordingly. See Supplementary Table 4 for more details about dataset processing and Supplementary Table 5 for information on the general composition of these datasets.
Each target sequence is 30nt long, encompassing 4nt upstream, 20nt target DNA protospacer (corresponding to the spacer), 3nt PAM and 3nt downstream from the PAM. To develop our model, we combined our SURRO-seq dataset, Song Train, Arbab dataset, Kissling ABE7.10 and Kissling ABE8e datasets (train and/or test set), while Song Test, Marquart Test and Arbab mESC and U2OS Test were only used for model evaluation (test set). The combined dataset consists of 17,941 unique 30nt target DNA sequences for ABE and 19,010 unique sequences for CBE. The identical 30nt sequences between our SURRO-seq dataset and Song Train were modest (20 gRNAs from ABE and 22 gRNAs from CBE) and were retained in both dataset sources. While the gRNAs from Kissling datasets are generated from the same library, 932 gRNAs exist in both datasets after above filtering steps and all the gRNAs were retained. See Supplementary Fig. 13 for details.
Generation of target DNA and outcome features
We processed the 30nt target DNA sequences and their corresponding edited outcome sequences by the following steps: 1) the 30nt target DNA sequences were one-hot encoded; 2) outcome sequences in the editing window were one-hot encoded/labeled (edited or not) based on the positions of the target nucleotide transitions in the editing window. Therefore, each outcome sequence was represented by one vector with length equal to that of the editing window. The value in the vector was set to 1 if the nucleotide was edited, and 0 otherwise. 3) The spacer RNA-target DNA binding energy ΔGB was calculated using the CRISPRoff pipeline 1.1.219 supplied with RNAfold 2.5.127. 4) CRISPR/Cas9 indel frequencies were predicted by CRISPRon17 (see details in Supplementary Note 2). 5) The origin of each training set was encoded in a “labeling” vector of size 5 for ABE (SURRO-seq dataset, Song Train, Arbab dataset, Kissling ABE7.10 dataset, Kissling ABE8e dataset) and 3 for CBE (SURRO-seq dataset, Song Train, Arbab dataset), as detailed in Supplementary Table 1.
Generation of dataset partitions
The dataset was split into six partitions with approximately equal sizes (within a count of one gRNA) for model development as previously done in CRISPRon17, resulting in five partitions for cross validation and the 6th partition for test. The procedure took the sequence similarity between 30nt target DNA sequences into account by assigning highly similar ones to the same partition while remaining dissimilar ones were randomly assigned into different partitions. In addition, we considered that part of the targets were included in the training of CRISPRon17, which is applied for Cas9-gRNA efficiency prediction, and that none of the data points in the 6th partition (test set) should belong to the CRISPRon training set to prevent overestimation (Supplementary Note 2). This was done as follows (see detailed workflow in Supplementary Fig. 14): 1) The pairwise Hamming distance between all 30nt target DNA sequences was calculated using the scikit-learn pairwise_distances function28, and extracting highly similar sequences (≤4 mismatches for a 30nt target sequence), as for CRISPRon17; 2) The 30nt target DNA sequences employed in CRISPRon model development across six validation sets were assigned to the sam