Main
A comprehensive understanding of the genetic architecture that underlies phenotypic diversity requires looking beyond single-nucleotide polymorphisms (SNPs) to encompass the full spectrum of genetic variation4,5,6,[7](#ref-CR7 “O’Donnell, S. et al. Telomere-to-telomere assemblies of 142 strains characterize the genome structural landscape in Saccharomyces cerevisiae. Nat. Genet. 55, 1390–1…
Main
A comprehensive understanding of the genetic architecture that underlies phenotypic diversity requires looking beyond single-nucleotide polymorphisms (SNPs) to encompass the full spectrum of genetic variation4,5,6,7,8. Although genome-wide association studies (GWASs) have uncovered thousands of loci linked to complex traits, they have historically concentrated on small variants, mainly SNPs, largely owing to constraints in detecting larger, more complex variants at the species level12,13,23. In this context, structural variants (SVs), including insertions, deletions, duplications and rearrangements, remain underexplored despite their potential to exert substantial phenotypic effects in complex traits. Emerging long-read sequencing strategies and pangenome approaches now enable high-resolution detection of SVs at the population level9,14,15,16,17,18,24,25,26, but assembling complete, telomere-to-telomere genomes for large cohorts remains a challenge.
On the phenotypic front, integrating molecular traits such as transcript, protein, and metabolite levels with organismal phenotypes provides a more detailed view of trait architecture20,27,28,29,30,31,32,33. Multilayered phenotypic data across large, natural populations are still uncommon. The budding yeast S. cerevisiae represents a unique opportunity in this context. With over 1,000 natural isolates spanning diverse ecological and geographic origins19, and rich datasets capturing both organismal and molecular phenotypes19,20,21,22, it is ideally suited for dissecting the contribution of complex variants to trait diversity. However, the lack of a population-scale SV catalogue has thus far limited our ability to fully resolve how different variant types shape phenotypic variation.
Here, we assembled near telomere-to-telomere genomes for 1,086 natural S. cerevisiae isolates using long-read sequencing, enabling a comprehensive catalogue of SVs and gene content diversity at the species level. By integrating this genomic resource with 8,391 molecular and organismal traits, we reveal that SVs are more frequently associated with phenotypic variation and exhibit greater pleiotropy than SNPs and small (less than 50 bp) insertions–deletions mutations (indels), particularly for organismal traits. A graph-based pangenome uncovered 2.5 Mb of non-reference sequence, underscoring the extent of uncharted genomic diversity. This study addresses a critical gap in our understanding of how different types of genetic variation contribute to phenotypic diversity.
High-quality assemblies of 1,482 genomes
To comprehensively capture species-wide diversity, we sequenced 989 natural S. cerevisiae isolates using Oxford Nanopore technology (ONT)19,34 (Fig. 1a), achieving an average depth of 95× and an N50 of 19.1 kb (Fig. 1b, Supplementary Fig. 1 and Supplementary Table 1). We supplemented this with ONT data from 14 beer isolates35 and 24 Taiwanese isolates36, for a total of 1,027 isolates. A hybrid assembly pipeline was utilized to maximize contiguity and completeness (Supplementary Fig. 2 and Methods), yielding chromosome-scale assemblies for 1,015 isolates. We also included 71 assemblies from the S. cerevisiae reference assembly panel7, resulting in 1,086 isolates overall.
Fig. 1: General framework and genome assembly for 1,086 isolates.
a, Schematics of the pangenome and association analyses. eQTL, expression QTL; pQTL; protein QTL; gCNVs, genomic CNVs. b, Long-read sequencing depth and reads N50 per isolate, for 989 newly sequenced isolates. The middle bar of the box plots corresponds to the median; the upper and lower bounds correspond to the third and first quartiles, respectively. The whiskers correspond to the upper and lower bounds 1.5 times the interquartile range (IQR). c, Haplotype resolution of genome assemblies for 1,086 isolates. d, Assembly contiguity represented as Nx value (length of the shortest contig in the group of the longest contigs that represent x% of the assembly length) for 1,482 assemblies. The white line represents the reference genome assembly, and the blue dashed line corresponds to the mean value for all assemblies.
These isolates vary in ploidy and zygosity, with 75% being diploid, of which 55.2% are heterozygous (Fig. 1c and Supplementary Table 2). Haplotype-resolved assemblies were generated for 396 of the 456 non-polyploid heterozygous isolates, and the remaining 60 were assembled in collapsed form (Fig. 1c and Supplementary Note 1). Altogether, we generated 1,482 high-quality assemblies across the 1,086 isolates (Supplementary Table 2).
Assembly quality was assessed across several metrics. Contiguity matched the reference genome (Fig. 1d), with a median of 1.06 contigs per chromosome and 97.2% of the chromosomes assembled into a single contig (Extended Data Fig. 1a, Supplementary Table 2 and Supplementary Note 2). Assembly sizes ranged from 11.17 Mb to 12.95 Mb (mean = 11.90 ± 0.17 Mb) (Extended Data Fig. 1b). Accuracy, based on Illumina data and Merqury estimates, had an average Merqury quality value of 41.5 (Extended Data Fig. 1c), and completeness averaged 99.1% by BUSCO, closely matching the reference score[37](https://www.nature.com/articles/s41586-025-09637-0#ref-CR37 “Huang, N. & Li, H. miniBUSCO: a faster and more accurate reimplementation of BUSCO. Preprint at bioRxiv https://doi.org/10.1101/2023.06.03.543588
(2023).“) of 99.4% (Extended Data Fig. 1d). These results confirm that, although they do not always encompass the entire telomere-to-telomere sequence (Supplementary Note 2), the assemblies exhibit contiguity and completeness close to those of the reference genome, reaching near telomere-to-telomere status.
Species-wide SV spectrum
This comprehensive set of genome assemblies enabled accurate detection of SVs (SVs larger than 50 bp) in a highly diverse population. Through pairwise alignment of the assemblies with the S288c reference genome (Fig. 1a), we identified a total of 262,629 redundant SVs across 1,086 isolates, corresponding to 6,587 unique events. Systematic validation of 500 SVs based on the mapping of short-read sequencing data confirmed sequence disruption in 95% of the SV calls (Supplementary Note 3). SVs were classified into four categories: presence–absence variations (PAVs, 4,755 events), segmental copy-number variations (CNVs) (1,207), inversions (231) and translocations (394) (Fig. 2a). Together, these SVs span a total of 27.3 Mb of sequences, excluding translocations. Transposable elements (TEs), particularly Ty elements, are major contributors to SVs in S. cerevisiae. Ty elements are found spanning over 50% of the SV sequence in 39% of PAVs (1,834 events), 20% of inversions (46 events) and 9% of CNVs (104 events) (Supplementary Table 3).
Fig. 2: SV landscape.
a, Number of non-redundant SV events per type and frequency in the population. Frequency categories are rare (MAF < 1%), low-frequency (1% ≥ MAF > 5%) and common (MAF ≥ 5%). INV, inversions; TRA, translocations. b, Rarefaction curves and extrapolation for each type of SV. c, MAF of TE-related and non-TE-related SVs. P value was calculated using a two-sided Mann–Whitney–Wilcoxon test (****P = 5.2 × 10−39). The middle bar of the box plots corresponds to the median; the upper and lower bounds correspond to the third and first quartiles, respectively. The whiskers correspond to the upper and lower bounds 1.5 times the IQR. d, Enrichment of SNPs, indels and SVs in subtelomeric regions. P values were computed using two-sided Fisher’s exact tests with FDR correction (****P = 0 for SNPs-indels, ****P = 3.5 × 10−234 for SNPs-SVs and ****P = 2.9 × 10−92 for indels-SVs,). e, Structural diversity along chromosomes, represented by outer blue rectangles, for each type of SV. Blue points correspond to regions that are outliers in structural diversity. The inner plot represents a map of translocations, coloured according to their MAF. f, Proportion of the SV types in the SV signature of wine, beer, Asian fermentation (AF) and wild isolates. Total represents all the SVs involved in any clade signature. P values were computed using Pearson’s chi-squared test with FDR correction (****P = 7.2 × 10−5). g, Correlation between the number of SNPs and SVs across 970 non-polyploid isolates using a Spearman correlation test (P = 4.4 × 10−211). Larger points correspond to the average value per clade. Coloured points indicate deviation from the correlation using Pearson’s chi-squared test with Bonferroni correction (P = 9.4 × 10−4 (AU wine 2), P = 9.8 × 10−5 (Alpechín), P = 3.2 × 10−4 (Georgian wine), P = 0.024 (Belgium beer 1), P = 0.049 (French dairy), P = 3.8 × 10−16 (Chinese wild)).
The large size of our population enabled precise quantification of SV diversity in S. cerevisiae. A median of 289 SVs was observed between isolates. We calculated a structural diversity of 2.0 × 10−5, defined as the average number of SVs per site between isolate pairs, two orders of magnitude lower than nucleotide diversity in the same population19 (Methods). Extrapolating SV accumulation with increasing sample size, we estimated a total of 7,237 SVs, indicating that our dataset captures more than 90% of all SV events in the species (Fig. 2b). Capture rates varied slightly by SV type, from 92.4% of inversions to 83.7% of translocations (Supplementary Table 4). We also estimated species coverage, the proportion of redundant SVs recovered, which reached 99.5%, suggesting nearly complete representation of shared SVs (Extended Data Fig. 2 and Supplementary Table 4).
In addition, the large sample size also enabled accurate estimation of SV allele frequencies in the population. Similar to SNPs, SVs are skewed towards low frequencies: 69% are rare (minor allele frequency (MAF) < 1%), 20% are low-frequency (1% ≤ MAF < 5%) and only 11% are common (MAF ≥ 5%) (Fig. 2a). Frequency patterns varied by SV type, translocations and inversions were rarer than PAVs and CNVs (Extended Data Fig. 3a). Their site frequency spectra resembled those of nonsense mutations19, suggesting strong deleterious effects (Extended Data Fig. 3b). Additionally, Ty-related SVs were more frequently shared across isolates than non-Ty-related ones (Fig. 2c and Extended Data Fig. 3c).
Finally, we used haplotype-resolved assemblies from 396 heterozygous isolates to assess structural heterozygosity in S. cerevisiae. The proportion of heterozygous SVs per isolate ranged from 11% to 94% (Extended Data Fig. 4a,b) and was strongly correlated with SNP heterozygosity (Spearman R = 0.79, P < 2.2 × 10−16; Extended Data Fig. 4c). SVs in subtelomeric regions showed higher heterozygosity (Extended Data Fig. 4d), consistent with the known structural variability of these regions. Inversions and translocations exhibited higher heterozygosity than PAVs and CNVs (Extended Data Fig. 4e). SV length was correlated with heterozygosity (Spearman R = 0.55, P = 2.6 × 10−8; Extended Data Fig. 4f), with SVs greater than 30 kb in size showing a marked shift towards heterozygosity (78% heterozygous) compared with smaller SVs (less than 30 kb, 45% heterozygous). This size effect helps explain the increased heterozygosity observed for typically larger SV classes such as inversions and translocations (Extended Data Fig. 4g).
Genomic distribution of SVs
We analysed the genomic distribution of SVs and found it to be highly uneven, with significant enrichment in subtelomeric regions (two-sided Fisher exact test, P = 1.1 × 10−309). Although SNPs and indels are also enriched in these regions, the enrichment is much stronger for SVs (Fig. 2d). By computing structural diversity along the genome, we identified bursts of diversity that are often specific to a single SV type (Fig. 2e) and defined 46 SV hotspots (Supplementary Table 5), including 21 translocation hotspots, nearly all in subtelomeric regions. One notable exception is a well-described reciprocal translocation between chromosomes 8 and 16, which confers sulfite resistance through the overexpression of the SSU1 gene located near the breakpoint38 (Fig. 2e). Hotspots for PAVs, CNVs and inversions often mapped to TE-rich regions or SV-prone genes, such as FLO39,40, CUP1 (ref. 41), and genes with tandem repeats, such as HPF1, SPA2 and NUM1 (refs. 42,43). An inversion hotspot on chromosome 14 overlaps a 24-kb region flanked by inverted repeats, probably driven by recombination44.
These findings indicate that SV hotspots arise either from genome fragility, linked to Ty elements or repetitive sequences, or from adaptive pressures targeting specific genes. To distinguish these mechanisms, we examined the distribution of hotspot SVs across species clades with known ecological origins45. Out of the 46 hotspots, 23 were evenly distributed across clades and localized to fragile regions (Supplementary Table 5), whereas the remaining 23 were clade-enriched, reflecting either population bottlenecks or local adaptation—for example, SVs associated with sulfite and copper resistance were enriched in wine isolates (Supplementary Table 5).
SV diversity and population structure
To explore the relationship between population structure and SV diversity, we built separate phylogenies using SNP and SV genotypes (Supplementary Fig. 3). Despite minor differences, the overall tree topology was conserved, with clades clustering consistently, indicating distinct SV signatures per clade. Using allelic enrichment of non-singleton SVs, we identified 1,933 SV alleles that were significantly over-represented in at least one clade, resulting in 3,559 clade–SV associations (Supplementary Table 6). The types of SVs involved varied by clade—for example, translocations were enriched in wine clades (Pearson’s chi-squared test, P = 7.2 × 10−5; Fig. 2f and Extended Data Fig. 5).
We also assessed whether SV and SNP diversity scaled similarly across clades. Although SV and SNP counts per isolate were generally correlated (R = 0.69, P < 2.2 × 10−16; Fig. 2g), deviations were observed. Wild clades, particularly the Chinese wild group, the species’ ancestral population19, deviates from this correlation, with fewer SVs than expected based on the number of SNPs, or more SNPs than expected based on the number of SVs (Fig. 2g). A similar trend in the Alpechín clade probably reflects SNP inflation due to introgression. By contrast, domesticated clades (French dairy, beer and wine) showed an excess of SVs, suggesting that SVs contributed to rapid adaptation during domestication.
Overall, our assemblies enabled a near-complete view of SV diversity in S. cerevisiae, revealing a landscape shaped by both population structure and adaptive processes.
A complete gene-based pangenome
The comprehensive analysis of high-quality genome assemblies enabled a complete reconstruction of the S. cerevisiae gene-based pangenome that delineates the exhaustive catalogue of genes that are present in the species. Across the population, we identified 8,541 gene families (hereafter referred to as genes), including 2,199 absent from the reference genome (Methods). Gene counts per isolate ranged from 6,438 to 6,814 (average of 6,651; Supplementary Fig. 4). The pangenome consists of 5,047 core genes shared by all isolates and 3,494 accessory genes with variable presence. Accessory genes were further classified into soft core (1,263 genes present in >90% of isolates), dispensable (2,102 genes in 0.001–90%) and private (129 genes unique to one isolate) categories (Fig. 3a). The high proportion of core and soft core genes (73.9%) indicates moderate gene content variation, consistent with a closed pangenome typical of many eukaryotes14,18,25,46,47,48. The genes captured by our population represent 99.5% of the species estimate (Fig. 3b and Supplementary Table 7), demonstrating the high completeness of our defined gene-based pangenome.
Fig. 3: Gene-based pangenome.
a, Distribution of the frequency of genes in the population. Colours correspond to different frequency categories (core, soft core, dispensable and private), and pie charts represent the number of genes in each category. b, Rarefaction curves of the number of genes for pan, core and accessory genomes. c, Distribution of gene location along chromosomes. Colours represent frequency categories. A large introgression event found in strain CPN produces a private gene signature between 424 and 590 kb on chromosome 7. d, Inferred origin of genes constituting the different frequency categories. e, Distribution of the gene length per origin. Dashed vertical lines represent the median value for each origin. Letters discriminate groups between which a two-sided Mann–Whitney–Wilcoxon test with FDR correction is significant with P < 0.05. P < 2.6 × 10−8 (A versus B), P < 1.8 × 10−31 (A versus C), P < 3.2 × 10−24 (B versus C). f, Number and origin of genes involved in the gene signature of each clade. Stripes indicate candidate origin, whereas the absence of pattern indicates a confident origin (Methods). g, Presence of MEL genes in the population. The gene tree was built from multiple sequence alignment of all genes of the pangenome associated to an alpha-galactosidase activity, in addition to the S. paradoxus and S. mikatea homologous genes. The inner tree represents the 1,086 isolates of our study and was built using a neighbour-joining strategy on SNP markers.
Core and accessory genes display distinct genomic and functional characteristics. Accessory genes are highly enriched in subtelomeric regions (two-sided Fisher’s exact test, odds ratio = 0.03, P < 2.2 × 10−16; Fig. 3c and Supplementary Table 8), further reflecting the genomic variability of these regions. Using previously generated transcriptomic data20, we found that core genes are more highly expressed than accessory genes (Supplementary Fig. 5), consistent with findings in other species14,18,49,50. Functional enrichment analyses confirmed that core genes are involved in essential biological processes (Supplementary Fig. 6 and Supplementary Table 9).
To investigate the origin of non-reference genes, we aligned novel gene sequences to a curated eukaryotic database (Supplementary Table 8 and Methods). Among the 2,199 novel genes, 1,233 (56.1%) showed highest similarity to close Saccharomyces relatives, suggesting introgression, and 358 (16.3%) were most similar to non-Saccharomyces species, which is indicative of horizontal gene transfers (HGTs). Another 516 genes (23.5%) showed low identity to reference homologues but aligned best to S. cerevisiae, suggesting rapid evolution, and were classified as fast-evolving genes. The remaining 92 genes (4.2%) lacked significant similarity and may represent de novo gene birth51. These four categories represent the bulk of the accessory genome (Fig. 3d), all sharing features such as subtelomeric localization and low expression (Extended Data Fig. 6). Gene length varies across categories, with introgressed genes being similar in size to reference genes, while HGTs, fast-evolving, and de novo genes tend to be shorter52 (Fig. 3e).
Gene content variation is structured by population. Clustering isolates by gene presence/absence reveals strong population stratification (Extended Data Fig. 7). Enrichment analyses showed widespread introgression across clades, with notably high levels in Alpechín, Mexican agave, and French Guiana isolates, confirming past hybridization events19,53,54 (Fig. 3f and Supplementary Table 10). HGTs were also identified in wine isolates, consistent with previous reports19,55, and partially shared with the Mixed Origins 1 clade, suggesting post-acquisition intraspecific gene flow.
While overall functional content remains conserved across isolates, some introgressed genes appear to confer novel traits. For example, we identified seven introgressed MEL genes encoding alpha-galactosidase activity, allowing growth on melibiose (Supplementary Table 8). These genes, present in phylogenetically distant clades and closely related to homologues from Saccharomyces paradoxus and Saccharomyces mikatae, are likely to represent parallel acquisitions (Fig. 3g), contributing to convergent functional adaptations.
SVs drive broad trait associations
The genome assemblies of more than 1,000 isolates enabled a comprehensive catalogue of genetic diversity, adding 44,804 SVs to the 1.4 million SNPs and 56,086 indels (<50 bp) identified previously (Methods). This resource complements previous phenotypic data spanning 241 colony growth traits (used here as organismal trait proxies), and 8,150 molecular traits, including transcriptomic and proteomic measurements19,20,21,22 (Fig. 1a). Including SVs and indels alongside SNPs increased trait heritability estimates by 14.3% on average (0.36 versus 0.41; Supplementary Fig. 7 and Supplementary Table 11), in line with earlier reports8,56. More importantly, this dataset enables GWASs at single-variant resolution, allowing for direct analysis of the phenotypic effects of SNPs, indels and SVs.
Using a linear mixed model57, we identified 7,768 significant associations linking 3,717 traits to 4,564 QTL (Fig. 4a and Supplementary Table 12), with 3,471 SNP-QTL, 230 indel-QTL and 863 SV-QTL. This corresponds to 6.5%, 10.5%, and 19.8% of tested SNPs, indels and SVs, respectively (Fig. 4b), revealing a strong enrichment of SV-QTL (two-sided Fisher’s exact tests with false discovery rate (FDR) correction, P = 6.6 × 10−161 and 4.9 × 10−20 versus SNPs and indels). SV-QTL also show greater pleiotropy, affecting 2.82 traits on average compared with 1.45 for SNP-QTL and 1.34 for indel-QTL (Fig. 4c; two-sided Wilcoxon tests with FDR correction, P < 10−14). Pleiotropic QTL, which are associated with more than one trait, account for 48.6% (419) of SV-QTL, whereas only 20.3% (2,766) and 21.3% (49) of SNP-QTL and indel-QTL (two-sided Fisher’s exact test, P < 10−14).
Fig. 4: A large catalogue of genome-wide associations.
a, Distribution of 4,564 QTL detected along the genome. The type of the leading variant involved is colour-coded. b, Proportion of each type of variant among QTL (inner circle) and the total set of common variants (outer circle). The bar plot indicates the odds ratio of the QTL in reference to the total set of variants. Letters discriminate groups between which a two-sided Fisher’s exact test with FDR correction is significant with P < 0.05. P = 1.9 × 10−12 (A versus B), P = 6.6 × 10−161 (A versus C), P = 4.9 × 10−20 (B versus C). c, Distribution of the number of traits associated per QTL, coloured by variant type. The dashed vertical line indicates the average number of traits. d, Number of traits associated depending on the position of QTL along the genome. QTL hotspots (associated with 20 traits or more) are highlighted. Colour and orientation correspond to the type of QTL.
SV-QTL are enriched in subtelomeric regions (Fig. 4a; two-sided Fisher’s exact test, P = 0 and 6.2 × 10−50 versus SNP-QTL and indel-QTL), in line with their known genomic location (Fig. 2d). Additionally, SVs contribute disproportionately to QTL hotspots: 15 SVs are each associated with at least 20 traits, compared with just 3 SNPs and no indels (Fig. 4d). One major SV-QTL hotspot involves a recombination-driven fusion of the ALD2 and ALD3 genes, associated with 66 expression and 30 growth traits. This SV arose independently multiple times, producing five alternate coding sequences (Extended Data Fig. 8) and is strongly enriched in Beer and French dairy isolates (two-sided Fisher’s exact tests with FDR correction, P = 7.06 × 10−30 and 7.45 × 10−18), underlining a possible positive selection in these specific environments.
Effect size estimates show that indel-QTL have the largest average effect (6.0 × 10−2), followed by SNP-QTL (3.9 × 10−2) and SV-QTL (3.4 × 10−2) (Supplementary Fig. 8). Indels exhibit significantly stronger effects than SNPs (1.5×, two-sided Mann–Whitney–Wilcoxon test, P = 3.4 × 10−12) and SVs (1.8×, two-sided Mann–Whitney–Wilcoxon test, P = 3.9 × 10−17), despite their lower pleiotropy. For molecular traits, QTL were classified as local or distant relative to the affected gene (Extended Data Fig. 9). We identified 2,131 local and 5,208 distant associations. Local QTL exhibit significantly higher effect sizes than distant QTL (6.2 × 10−2 versus 3.0 × 10−2; two-sided Mann–Whitney–Wilcoxon test, P = 3.4 × 10−12; Extended Data Fig. 10a), a pattern that holds across all variant types (Extended Data Fig. 10b). Notably, indels are strongly enriched for local associations, with 54.5% of indel-QTL classified as local, compared to 28.8% for SNPs and 26.2% for SVs (Extended Data Fig. 10c; two-sided Fisher’s exact tests with FDR correction, P < 10−18). This higher proportion of local QTL among indels likely contributes to their overall greater effect size relative to SNPs and SVs.
Distinct phenotypic effects of SV types
The precise characterization of SVs has enabled further investigation into the phenotypic effects of the different types of SVs. We identified 615 CNV-QTL, 192 deletion-QTL, 54 insertion-QTL and 2 translocation-QTL. Of the three common inversions present in our dataset, none was associated with a phenotypic variation. The limited number of associated translocations prevents any comparison of their phenotypic effect with other types of SVs. Associated SVs constitute 20.9% of the total deletions, 19.2% of CNVs and 13.5% of insertions. This finding indicates an enrichment of QTL in deletions and CNVs in comparison to insertions (two-sided Fisher’s exact test, P = 9.4 × 10−3 and 0.026, respectively). Deletion-QTL exhibit an average effect size of 4.1 × 10−2, which is 1.2-fold that of CNV-QTL (3.3 × 10−2; two-sided Mann–Whitney–Wilcoxon test, P = 3.5 × 10−4) and 2.2-fold that of insertion-QTL (1.9 × 10−2; two-sided Mann–Whitney–Wilcoxon test, P = 3.3 × 10−9) (Supplementary Fig. 9). In addition, associated deletions have a local effect in 25.0% of the cases, which is analogous to the 25.4% of local associations for CNVs but lower than the 47.7% of local associations for insertions (two-sided Fisher’s exact test, P value = 9.4 × 10−4). Overall, these results reveal the limited phenotypic effect of insertions in comparisons to other SVs, as evidenced by a reduced fraction of QTL and a diminished effect size. Insertions appear to be constrained to their local effect and are less frequently acting in trans.
We further aimed to assess the difference in phenotypic effect of SVs related or non-related to TE sequences. Among common SVs, 13.1% of the TE-related SVs were found to be associated with the variation of at least one trait, which is similar to the 13.6% of non-TE-related associated SVs (two-sided Fisher’s exact test, P value = 0.91). Unlike SV-QTL, TE-related SV-QTL are never located within subtelomeric regions, which is expected given the scarcity of TEs in these regions (95.6% of all TE-related SVs are located outside subtelomeric regions). QTL involving TE and non-TE-related SVs exhibit an average effect size of 2.41 × 10−2 and 2.43 × 10−2, respectively, which represents minimal variation (two-sided Mann–Whitney–Wilcoxon test, P value = 0.026). Overall, TE-related SVs exhibit a similar phenotypic effect than other SVs.
Complexity differs across trait types
A key strength of this dataset is the inclusion of both molecular and organismal phenotypes within the same population, allowing direct comparison of their genetic architectures. We identified 4,444 QTL for molecular traits and 168 for organismal traits, averaging 0.9 and 1.7 QTL per trait, respectively (Fig. 5a). This suggests that organismal traits are probably genetically more complex, involving a larger number of contributing loci (two-sided Mann–Whitney–Wilcoxon test, P = 1.3 × 10−8). By contrast, QTL for molecular traits showed significantly higher effect sizes on average (3.9 × 10−2 versus 2.7 × 10−2; two-sided Mann–Whitney-Wilcoxon test, P = 8.1 × 10−12) (Fig. 5b).
Fig. 5: Different genetic architectures of molecular and organismal traits.
a, Distribution of the number of QTL identified per trait. The dashed lines represent the average number of QTL associated per trait. The type of trait (growth or molecular) is colour-coded. b, Effect size of QTL associated with variation of molecular and growth traits. The P value was computed using a two-sided Mann–Whitney–Wilcoxon test (****P = 8.1 × 10−12). The middle bar of the box plots corresponds to the median; the upper and lower bounds correspond to the third and first quartiles, respectively. The whiskers correspond to the upper and lower bounds 1.5 times the IQR. n denotes the number of associations. c, Proportion of the different types of variants found within all variants, molecular and growth QTL. d, Graphical representation of the Minigraph pangenome. The path of the linear reference genome is indicated in orange. Segments are coloured according to their nucleotide sequence identity with the reference genome to highlight non-reference sequences.
The type of associated variants also differs. SV-QTL make up 18.6% and 41.1% of the total QTL for molecular and organismal traits, respectively, both enriched relative to the 7.4% frequency of SVs among common variants (two-sided Fisher’s exact test, P = 2.4 × 10−118 and 2.6 × 10−33). However, the enrichment is stronger for organismal traits (5.6-fold) than for molecular traits (2.5-fold), suggesting that SVs have a more prominent role in shaping complex organismal phenotypes (Fig. 5c and Supplementary Fig. 10).
These findings highlight distinct genetic architectures: organismal traits tend to involve more, we