Introduction
Sequence alignment is a fundamental starting point for genomic research and clinical diagnostics, serving as the critical bridge between raw sequencing data and meaningful biological insights. Accurate alignment directly influences the quality of downstream analyses, such as variant detection, genome assembly, comparative genomics, and personalized medicine. In particular, the detection of structural variations (SVs)—genomic rearrangements such as inversions, duplications, translocations, and complex clustered events—relies heavily on precise alignment1. However, the inability of existing linea…
Introduction
Sequence alignment is a fundamental starting point for genomic research and clinical diagnostics, serving as the critical bridge between raw sequencing data and meaningful biological insights. Accurate alignment directly influences the quality of downstream analyses, such as variant detection, genome assembly, comparative genomics, and personalized medicine. In particular, the detection of structural variations (SVs)—genomic rearrangements such as inversions, duplications, translocations, and complex clustered events—relies heavily on precise alignment1. However, the inability of existing linear aligners to adequately represent complex SVs remains a barrier to progress in genomic science. Misaligned or misrepresented SVs hinder downstream analyses, leading to gaps in our understanding of genetic variation and its impact on human health and disease.
SVs, defined as genomic alterations of 50 base pairs or larger, are among the most impactful sources of genetic variation1,2,3,4. They affect more nucleotides in the genome than smaller variants, such as single-nucleotide polymorphisms (SNPs) or small insertions and deletions (indels)3,5,6,7. Consequently, their influence is being recognized across evolutionary processes, human health, and contribute to both Mendelian and complex diseases as well as cancer development8,9,10,11. Despite their critical importance, our understanding of SVs remains limited12. This is in part due to the complexities of complex SV, but also the inability of the current state of art aligners to represent them. Research has predominantly focused on simpler SV classes, such as deletions and insertions, while more intricate variations—like duplications, inversions, and complex clustered events—are often misrepresented and thus missed. This gap in our analytical approach hinders a comprehensive understanding of these important genomic phenomena13,14, despite their importance being highlighted in several studies already7,8,9,10,11,14,15,16.
These studies have mainly been driven by long-read sequencing that indeed enables the characterization of tandem repeats and thus regions where SV are predominantly observed17. Despite their advancements, long-reads are challenging to align due to their length and generally higher sequencing errors. In the past, others have highlighted that specialized aligners are required to accurately align them using predominantly linear alignments to the reference, where multiple linear subalignments represent a potential SV. These subalignments are identified through a process commonly known as the seed-chain-extend algorithm, tailored for long-read mapping. In this approach, seeds (exact matches, such as k-mers) are identified, and a co-linear subset of these seeds is selected to form chains, which are then extended bidirectionally until significant differences between the read and reference sequences are encountered. For reads without SVs, a single chain (alignment) can represent the entire read. However, the presence of SVs requires multiple chains (subalignments) to represent the read fully—for example, a read spanning an inversion typically requires three subalignments to capture the inversion and its flanking regions. Due to the complexity of SVs and their tendency to occur in repetitive regions, the seed-chain-extend approach often generates a pool of redundant subalignments, necessitating an additional step to determine the optimal subset of subalignments. For instance, minimap2 employs a co-linear chaining algorithm to identify the set of all possible co-linear chains. It then uses a greedy strategy during primary chain selection18, which determines the optimal subset of subalignments to represent the read. This process initializes an empty set Q and iteratively processes subalignments from the highest to the lowest chaining scores: if a subalignment overlaps with a chain in Q by 50% or more of the shorter subalignment’s length, it is marked as secondary; otherwise, it is added to Q, ultimately representing the read with the subalignments in Q18. In contrast, YAHA adopts a graph-based approach, leveraging Optimal Query Coverage algorithm to finds the optimal set of subalignments that cover the length of the query19. NGMLR builds on a similar strategy, enhancing it with a refined scoring function to identify the optimal combination of subalignments with the highest joint score20. Despite these advancements, post-alignment processes for determining the ideal set of sub-alignments are often inadequate because of the complexity of the allele and the underlying repetitive regions. They fail especially for duplication, inversion, and translocation, which are often missed or falsely identified by the linear alignment algorithms. Duplications, for example, are often misaligned as insertion because linear alignment algorithms prefer a single continuous alignment and treat duplication, which would require splitting reads, as insertion. Similarly, inversion and translocation, are often rather misaligned as splitting the reads is penalized and thus avoided.
To overcome the challenges posed by existing alignment methods, we introduce VACmap, a long-read mapping tool developed to improve the representation of all types of SVs. VACmap uses a non-linear alignment algorithm that captures an entire read as a unified, non-linear alignment. This approach streamlines the traditional alignment process by eliminating the need for splitting reads and selecting from multiple linear alignments. We demonstrate that this approach improves the representation of complex alleles, providing a more accurate and comprehensive view of SVs.
Results
The workflow of the non-linear alignment algorithm
Figure 1 gives an overview of VACmap’s non-linear mapping approach. The key important differentiation between VACmap and other approaches is implemented after initial matches between reference and read sequences have been identified. Here, existing methods try to conserve the order of all subalignments by heavily penalizing splits when searching chains of matches maintained. The linear alignment approach can efficiently model genomic alterations such as deletions, insertions, and substitutions since these don’t break the co-linearity of a chain. However, the linear approach penalizes the detection of complex SV such as duplication, inversion, translocation, or combinations of SV. In VACmap, we propose a hybrid alignment algorithm, which combines both linear and non-linear linkage approaches in a chain. In detail, VACmap represents matches as quadruples called ‘anchors’, which include the start positions in the long read and reference sequences, the strand match, and the anchor’s length. They are ordered by their end positions in the long read. The VACmap’s non-linear chaining algorithm then promotes the extension of the chain to subsequent anchors that preserve a strictly linear relationship with the preceding anchor, enhancing this connection with a positive score. Conversely, it penalizes extensions to anchors that disrupt this linearity by assigning negative scores to such connections. Then the optimal non-linear alignment of the entire sequence is the chain with the highest aggregate score (the longest path). Each of these linear subalignments can be extracted by dividing the non-linear alignment at the non-linear junction, eliminating the traditional necessity of additional post-alignment steps that reconstruct genomics rearrangement from a pool of error-prone independent subalignments (See “Methods” for details).
Fig. 1: The workflow of VACmap non-linear alignment algorithm.
VACmap begins by identifying matching k-mers between the long read and the reference genome (blue indicates the forward strand, orange indicates the reverse strand). Next, it computes non-linear alignments and selects the one with the highest score. Finally, VACmap divides the highest-scoring non-linear alignment into multiple linear subalignments, enabling straightforward interpretation of SV.
Assessing the impact of VACmap on detecting complex SVs in synthetic data
To assess the impact of VACmap on variant detection in downstream applications, we conducted a series of tests using synthetic datasets. We generated synthetic long-read datasets containing a wide range of SVs, both simple and complex, using a custom tool we developed called VACsim, addressing the absence of simulation tools for complex SVs. VACsim introduced 30,000 SVs, each composed of 1 to 20 basic SV events, including deletions, insertions, duplications, inversions, and translocations.
As illustrated in Fig. 2a, VACmap’s alignment data enhanced SVIM’s20 ability to detect complex SVs in simulated data from PacBio CLR, PacBio HiFi, and ONT, with F1 score improvements ranging between 29.5 and 73.2 percent (refer to Supplementary Table 1). For complex SVs located within repetitive sequences, the use of VACmap-produced alignments provided gains in precision and recall, performing better than other methods by about 35.2 to 64.6 percent in F1 score (see Supplementary Table 1). Figure 2b shows the recall rates of SVs detection under different SV complexity and sequencing technology. NGMLR, minimap2, Winnowmap2, and LRA21 were shown to be adequate only for identifying complex SVs comprising up to two simple SV events. Beyond this complexity, the recall rates of SVIM decreased when using alignments from these tools. Conversely, SVIM with VACmap alignments consistently displayed sensitive and reliable SV detection across the full spectrum of SV complexities.
Fig. 2: Comparison of five mapping methods in downstream complex SV detection using SVIM on synthetic data.
a The precision, recall, and F1 scores (dashed line) of SVIM’s complex SV detection performance in all chromosomes and the repetitive region (repeats) across different read depths and sequence technologies. b The recall rates of SVIM’s complex SV detection using alignments produced by five mapping methods under varying SV complexity and sequence technologies. The shaded color represents results in the repetitive region. c Box plots of the estimated tandem duplication copy numbers for each mapping method using 40-fold coverage ONT simulated data. Panels are arranged in two rows: the top row for all regions; the bottom row for repetitive regions. The green dashed line indicates the ideal (true) copy number. The orange dotted line represents the median estimated copy number across data points for each target. Notably, VACmap demonstrates improved performance in the downstream SV detection task across different SV complexities. For box plots, data are presented as median values (centre, horizontal line within each box) with bounds of the box representing the interquartile range (IQR; lower bound: first quartile or 25th percentile [Q1]; upper bound: third quartile or 75th percentile [Q3]). Whiskers extend from the box bounds to the minima (smallest value within 1.5 × IQR of Q1) and maxima (largest value within 1.5 × IQR of Q3); outliers beyond whiskers are not shown (filtered). No error bars are displayed. For all regions: VACmap n = 9557; minimap2 n = 7227; NGMLR n = 8626; Winnowmap2 n = 5674; LRA n = 0. For repetitive regions: VACmap n = 1061; minimap2 n = 387; NGMLR n = 324; Winnowmap2 n = 224; LRA n = 0. Source data are provided as a Source Data file.
For precise gene copy number quantification, particularly in tandem duplications that might influence protein levels, accurate mapping is essential. To investigate the performance of copy number estimations with alignments produced by different aligners, we generated 10,000 tandem duplications on chromosome 1 using VACsim. These duplications had repeat unit sizes ranging from 100 to 500 base pairs and repeat counts between 1 and 20. SVIM was employed to estimate the copy number for each tandem duplication from the various aligners’ alignments. According to the results depicted in Fig. 2c and Supplementary Table 2, alignments from current mapping methods led to a bias in copy number estimation. There was a decline in the linear correlation between the actual and estimated copy numbers as the repeat count grew, especially within repetitive areas. In contrast, alignments from VACmap resulted in more precise copy number estimates across diverse copy number intervals and within repetitive regions. This underscores VACmap’s capability in accurately ascertaining the copy number of tandem duplications, indicating its effectiveness and accuracy in dealing with complex genomic structures.
Evaluation using genome in a bottle benchmark
We evaluated the SV detection performance of VACmap, NGMLR, Winnowmap2, minimap2, and LRA alignments using SVIM and cuteSV with the GIAB benchmark set4,20,21,22,23,24,25. Truvari26 was used to assess precision, recall, and F1 scores. Before evaluation, SVIM’s and cuteSV’s tandem duplication calls were relabeled as insertions to allow for comparability to the GIAB assembly-derived benchmark. As expected, all five alignment approaches demonstrated similar performance in detecting deletions and insertions in both GIAB tier 1 and CMRG regions (Fig. 3a, b). And the runtime of VACmap is faster than NGMLR and comparable with Winnowmap2 and LRA, but slower than minimap2. However, VACmap requires lower memory usage than the other aligners (Supplementary Table 3). It should be noted that NGMLR is no longer actively maintained, which may contribute to its performance limitations compared to more actively developed tools.
Fig. 3: Comparison of five mapping methods in downstream SV detection using HG002.
a–d Performance assessment of SVIM and cuteSV using five aligners’ alignments on GIAB Tier 1 and CMRG benchmarks. e Distribution of SV types (deletions [DEL], duplications [DUP], insertions [INS], and inversions [INV]) and their size ranges detected by SVIM using alignments produced by VACmap and minimap2. VACmap alignments revealed a broader and more balanced distribution of SV types and sizes compared to minimap2, which exhibited biases toward specific SV categories and sizes. These results highlight the advantages of VACmap in comprehensive SV detection. f Venn diagram showing the overlap of inversions detected by SVIM using alignments from VACmap, NGMLR, minimap2, Winnowmap2, and LRA on PacBio HiFi data. VACmap enabled the detection of the highest number of unique inversions compared to the other aligners. Source data are provided as a Source Data file.
To evaluate SVIM’s and cuteSV’s sensitivity in detecting duplications using alignments from different tools, we isolated tandem duplication calls within the GIAB benchmark set using REPTYPE annotation. The results (Fig. 3c, d) showed that SVIM, using VACmap-produced alignments, exhibited high sensitivity for duplication detection, identifying approximately 70% to 80% more duplications compared to other alignment approaches in the GIAB tier 1 and CRMG regions, respectively. This is highly important for the interpretability of the impact of SV. Additionally, the SV distribution detected with VACmap alignments showed notable differences compared to other aligners (Fig. 3e and Supplementary Fig. 1). VACmap indicated that more than 67% of the sequence gain was due to duplications, consistent with previous findings14. In contrast, minimap2 attributed only 1% of the total sequence gain to duplications. This discrepancy in SV classification is critical for interpreting the biological impact of SVs, underscoring the importance of accurate SV detection.
VACmap’s ability to accurately map duplicated segments also enabled us to characterize a previously reported de novo variation27 (Fig. 4a–c and Supplementary Fig. 2). This variation, located within a Tandem Repeat (TR) region at chr14:23,280,711 (GRCh38), was originally labeled as a de novo insertion, as different insertion sizes were observed in the child (HG002: 537 bp) and the parents (HG003: 214 bp and HG004: 15 bp). However, with VACmap’s alignment, what was initially thought to be an insertion was revealed to be a 109-bp Variable Number Tandem Repeat (VNTR), with varying repeat counts in the child (five repeats) and the paternal parent (two repeats). TR regions are known to be variable in the number of repeats, often changing between generations due to mechanisms like replication slippage and unequal crossing over during meiosis. These processes can lead to differences in repeat counts, which explains the variation observed between the child and the father in this case.
Fig. 4: Enhanced detection of complex variants by VACmap.
a, b VACmap (a) accurately identifies a 109-bp Variable Number Tandem Repeat (VNTR) at chr14:23,280,711 (GRCh38), with five copies in HG002 and two copies in HG003. In contrast, minimap2 (b) misclassifies the same variation as a 537-bp insertion in HG002 and 214-bp insertion in HG003. c Schematic representation of the correct repeat structures identified by VACmap in HG002 and HG003, compared to minimap2’s misinterpretation as insertions. d VACmap detects precise breakpoints for a 16-kb inversion in the SPIDR gene (blue dashed line), while other aligners show more mismatched bases and incorrect breakpoints (red dashed lines). e Alignment score comparison highlighting VACmap’s ability to switch between forward and reverse strands, resulting in more accurate inversion breakpoint detection than minimap2.
This example highlights a limitation of conventional alignment algorithms, which often misinterpret duplications as insertions. Traditional aligners rely on maintaining the relative order of sequences when aligning them. However, duplications disrupt this order, making it difficult for linear aligners to correctly map such regions. As a result, duplications are often misaligned as insertions or entirely ignored. In contrast, VACmap’s non-linear alignment approach accurately handles these complex repeat structures, providing a more precise representation of the true genetic variation.
Enhance the characterization of complex inversions in repetitive regions
We then analyzed the inversion callsets generated by five different SV detection pipelines. The VACmap-SVIM callsets captured nearly all of the inversions (105 out of 116) identified by the combined callsets of minimap2, Winnowmap2, NGMLR and LRA, and additionally uncovered 97 inversions not detected by these approaches (Fig. 3f). When comparing inversions that overlapped with a previously reported callset28, the VACmap-SVIM pipeline identified nearly all the inversions (48 out of 49) detected by the other three pipelines, while also discovering 14 inversions that were missed by the other methods (Supplementary Fig. 3). Upon manual inspection of an inversion missed by VACmap-SVIM, we found a more complex structure—an inversion flanked by an inverted duplication and deletion. While VACmap could resolve this complex structure, SVIM failed to detect it because the intricate structure did not align with SVIM’s predefined rules for identifying inversions (Supplementary Fig. 4).
Thus, highlighting that inversions remain challenging to resolve because their locations are often surrounded by large segmental duplications. To further investigate this, we analyze the combined call set of 213 inversion regions from five aligners. Across all inversions, 32% (68/213) of them overlap with segmental duplications, and half of them (39/68) are only detectable through VACmap alignment. For instance, VACmap alignment enables accurate identification of a homozygous 16-kb inversion located in the SPIDR gene (Fig. 4d), a gene involved in DNA repair and associated with gonadal dysgenesis diseases29. On the contrary, other aligners’ alignments are less reliable, as they showed more mismatch bases (i.e., signal of wrongly mapping of reads20) and inconstant breakpoints across different read alignments. The standard deviation of inversion sizes called by SVIM is 291.4 for VACmap alignments and 2066.4 for NGMLR alignments, respectively. A higher variance will be considered an unreliable SV prediction and assign a lower quality score (The SVIM quality score for this inversion is 14 and 0 for VACmap and NGMLR alignments, respectively, and will be discarded).
Figure 4e demonstrates why minimap2 and other linear aligners fail to accurately pinpoint inversion breakpoints. Linear alignment methods, such as minimap2, rely on heuristic strategies like the Z-drop heuristic to infer breakpoints18. These methods monitor the alignment score and split the alignment when the score drops below a predefined threshold (indicated by the red dashed line in the figure). However, this approach often fails to identify the precise breakpoint because after the inversion, the sequence in the read is not significantly divergent from the reference. As shown in the figure, the alignment score continues to increase slowly rather than showing a sharp drop, leading minimap2 to incorrectly place the breakpoint upstream (marked by the red dashed line).
In contrast, VACmap’s non-linear alignment algorithm can simultaneously evaluate both forward and reverse strands (blue and orange curves, respectively) and automatically switch between them to maximize the alignment score. This allows VACmap to correctly identify the true inversion breakpoint, as it can seamlessly align both strands and capture subtle changes in the alignment score. The result is a more accurate alignment and a precise breakpoint, as reflected in the figure, where VACmap’s breakpoint (blue dashed line) aligns with the actual inversion. Supplementary Figs. 5–9 provide further examples of how VACmap performs better than traditional aligners in mapping complex inversions.
Improve identification on SIGLEC11::SIGLEC16 and RHCE::RHD gene conversion
Gene conversion is a challenging form of SV that is difficult to capture accurately using current alignment algorithms and SV detection tools. Figure 5a and Supplementary Fig. 10 illustrate an inversion initially misidentified by SVIM, which was actually a gene conversion event between the SIGLEC11 and SIGLEC16 genes on the maternal haplotype. These two genes share highly similar sequences in the regions encoding their extracellular domains, due to past gene conversion events30. The most recent conversion, which occurred approximately one million years ago, involved regions A in SIGLEC11 and A* in SIGLEC1630 (Fig. 5b). However, VACmap’s alignment revealed a gene conversion event involving different regions, B and B*, in these two genes.
Fig. 5: Comparison of five aligners on gene conversion events.
a The IGV visualization of the SIGLEC11 and SIGLEC16 gene conversion event, the SVIM inversion prediction is shown in the top panel. b Proposed scenario of gene conversions between SIGLEC11 and SIGLEC16 loci. c The IGV visualization of a potential RHD and RHCE gene conversion event. d, e The IGV visualization of SIGLEC11/SIGLEC16 and RHD/RHCE gene conversion events using the HG002 assembly.
Notably, the B* region in SIGLEC16 had previously been flagged by the GIAB consortium due to a cluster of heterozygous small variants23. However, GIAB’s alignment methods, which rely on minimap2, were unable to resolve this gene conversion, resulting in numerous false-positive SNP calls in both the GIAB CMRG benchmark set and the draft release of the GIAB T2T SV benchmark (Fig. 5d). This outcome is not surprising given minimap2’s limitations in handling complex rearrangements, as it struggles to split reads or assemblies appropriately to represent gene conversion events, leading to misalignments and erroneous variant calls.
Additionally, VACmap successfully resolved a homozygous gene conversion event between the RHCE and RHD genes, which had been inaccurately represented by existing aligners (Fig. 5c). This correction reduced over a hundred false-positive SNP and indel calls in the GIAB benchmark sets (Fig. 5e and Supplementary Figs. 11 and 12). This highlights VACmap’s ability to detect and accurately characterize gene conversion events that are typically missed or misclassified by conventional linear alignment methods.
Evaluation using the LPA, GBA1, and STRC genes
We next assessed the LPA gene to highlight a medically important region that is further improved using VACmap. The complexity of this region raises due to high diversity in the population which represents 5–40 copies of the KIV-2 repeat in the LPA gene10. This copy number is inversely correlated with human lipoprotein(a) levels, which are strongly linked to coronary heart disease10. However, quantifying the KIV-2 copy number accurately poses challenges due to repetitiveness and thus the low mappability of sequencing reads in the LPA gene region31. We assessed the performance of five mapping methods by aligning PacBio HiFi and ONT sequencing data from human samples (CHM13 and HG002) against the GRCh38 reference genome. IGV visualizations revealed that NGMLR, Winnowmap2, minimap2, and LRA produced alignments with more mismatches and less informative coverage information compared to VACmap (Fig. 6a). VACmap demonstrated an ability to accurately represent KIV-2 repeats, showing clear and distinct coverage boundaries (Supplementary Figs. 13–15).
Fig. 6: Comparison of five aligners on the LPA gene.
a, b The IGV visualization of alignments produced by five aligners using GRCh38 and modified GRCh38 reference in the KIV-2 region. c The illustration of GRCh38 reference modification and the exon structure of the KIV-2 domain. The exon 2 (red) in the type A KIV-2 repeat unit, type B KIV-2 repeat unit, and KIV-1 repeat unit have 100% identical sequences. The exon 1 (purple) in the KIV-3 repeat unit and type B KIV-2 have 100% identical sequences. The light blue and light orange regions indicate the reserved and removed regions, respectively. d The dot plot depicts non-linear alignments generated by the VACmap algorithm of GRCh38, CHM13, and HG002 assembly against the modified GRCh38 reference. e, f The alignment scheme of the type A KIV-2 sequence and type B KIV-2 sequence against the modified GRCh38 reference.
To simplify KIV-2 copy number determination, we modified the GRCh38 reference by removing the second to sixth KIV-2 repeat units and including the first 1000 bp sequence of the follow-up KIV-1 unit (Fig. 6c). We then realigned the PacBio HiFi and ONT data to the modified reference. The IGV visualizations indicated that VACmap-produced alignments (Fig. 6b and Supplementary Figs. 16–18) showed the expected alignment scheme of both type A and type B KIV-2 units (Fig. 6e, f. Other mapping methods struggled to produce correct alignments despite the reduced complexity of the modified reference. Furthermore, the ONT reads facilitated the resolution of all 23 copies of the KIV-2 repeat unit in the CHM13 sample due to its longer read length compared to PacBio HiFi data (Supplementary Fig. 19).
Then, we aligned the GRCh38 assembly, CHM13 assembly, and HG002 assembly to the modified GRCh38 reference. The non-linear alignment of these three assemblies is shown in Fig. 6d. Consistent with previous findings10, we found the GRCh38 assembly consisted of six copies of KIV-2 repeat units with the pattern “AAABAA” where “A” indicates the type A KIV-2 repeat unit, and “B” indicates the type B KIV-2 repeat unit. In the CHM13 assembly, 23 KIV-2 repeat units were identified, following the pattern: “BBBBBBAABAAAAAAAAAAAAAA”. Similarly, the HG002 paternal assembly contains 24 KIV-2 repeat units with the pattern “BBBBBBAAABAAAAAAAAAAAAAA”, while the HG002 maternal assembly consists of 14 KIV-2 repeat units arranged as “AAAAAAAAAAAAAA”.
To further demonstrate the clinical utility of VACmap, we chose GBA1. This is a major risk factor for Parkinson’s disease32, a challenging gene to analyze33, which is prone to structural variants caused by recombination with a nearby highly homologous pseudogene (GBAP1). We previously detected using ONT long reads with adaptive sampling a pathogenic deletion which could not be correctly called after minimap2 or NGMLR alignment34. In contrast, VACmap allowed SVIM and cuteSV to correctly report the breakpoints (Fig. 7a), which is crucial in determining whether a deletion is pathogenic. Similarly, the STRC gene is a known deafness-associated gene causing mild-to-moderate hearing loss34 and is inherited in an autosomal recessive manner. However, it’s hard to detect due to its location in tandem duplication region and the presence of a highly homologous (>99%) pseudogene (STRCP1)35. By examining the GIAB samples using VACmap-produced alignments (Fig. 7b), we identified a heterozygous deletion in NA1924036 involving the loss of one copy of the CKMT1B-STRC-CATSPER2 gene cluster. However, all four other aforementioned aligners failed to pinpoint the deletion. In addition, Duplomap37, a specialized aligner for remapping reads in tandem duplications, cannot detect the deletion, since it uses minimap2 internally, which often hesitates to split reads.
Fig. 7: Comparison of five aligners on GBA1/GBAP1 and STRC/STRCP1 gene.
a The IGV visualization of alignments produced by five aligners in the GBA1 / GBAP1 region, with the SVIM, cuteSV deletion call shown in the top panel. b The IGV visualization of alignments produced by five aligners in the STRC/STRCP1 region, with the SVIM, cuteSV deletion call shown in the top panel.
Discussion
Sequence alignment is a fundamental starting point for virtually all genomic research and clinical diagnostics. It serves as the crucial bridge between raw sequencing data and the biological insights necessary for understanding genetic variation, evolutionary biology, and the molecular basis of diseases. Accurate alignment of sequencing reads to a reference genome is essential for a wide range of applications, including variant detection, comparative genomics, and personalized medicine. The precision and reliability of this initial alignment process directly influence the quality of downstream analyses, impacting our ability to identify genetic variations such as SNPs, indels, and critically SVs.
Despite significant advancements in sequencing technologies, particularly with the advent of long-read sequencing platforms, accurately aligning reads that encompass complex genomic rearrangements remains a formidable challenge. Traditional linear alignment algorithms are often inadequate for handling large-scale SVs such as inversions, duplications, translocations, and complex combinations of these events. These limitations create a cascade of analytical failures: when alignment is incorrect, subsequent analyses become unreliable or impossible, regardless of the sophistication of downstream tools. As a consequence, crucial SVs—including those with significant medical relevance—may be misrepresented or entirely missed, impeding our ability to fully understand their biological significance and clinical implications.
Graph-based genome representations offer significant advantages over linear reference genomes by providing a flexible framework to encode SVs such as duplications, inversions, and translocations as graph structures, enabling a more comprehensive representation of genetic diversity across populations38. This approach can facilitate the integration of multiple genomes into a single graph, potentially improving variant calling and haplotype resolution in complex regions. However, we show that existing algorithms, such as minigraph39, encounter substantial difficulties in producing correct genome graphs (Supplementary Note 1). Their reliance on co-linear matching during graph construction and read alignment often results in erroneous topologies, misinterpreting non-linear SVs (e.g., duplications as insertions or inversions as misalignments).
To address these challenges, we present VACmap. VACmap breaks through this long-standing barrier of inaccurately representing complex variants. This is achieved via a non-linear mapping approach and demonstrates the need for this method, especially on inversions and other critical medically challenging genes such as LPA, GBA1, and STRC. Indeed, inversions remain challenging to resolve, especially due to their location often surrounded by large segmental duplications28. Furthermore, these regions often form more complex events than simple inversions. Neither complex or simple inversions are routinely detectable with state-of-the-art methods28, despite their clinical importance15. VACmap enables this detection with more precise alignments of read segments than any other method available due to its non-linear mapping approach. This further improves the characterization of complex duplications, such as shown in KIV-2 a region in LPA itself and of gene/pseudogene recombination as shown in GBA1 and STRC. VACmap can more precisely recapitulate the exact breakpoints within the reads, which leads to an improved detection and thus will provide more insights. These are only a few examples of multiple medically important but challenging genes that VACmap can improve upon and thus deliver a more precise picture of the variants currently often missed by analytical methods23.
Methods
Ethics statement
Ethics approval for the GBA1 carrier was provided by the National Research Ethics Service London—Hampstead Ethics Committee as part of the RAPSODI study (www.rapsodi.com)40. Informed consent was provided.
Variant-aware chaining algorithm
Algorithm overview
Tradition