Main
Short tandem repeats (STRs) of 1–6 bp of DNA are mutable genomic elements with diverse influences on cellular and organismal phenotypes2. Common STR polymorphisms, which have been characterized in human populations using short-read3 and long-read[4](https://www.nature.com/articles/s41586-025-09886-z#ref-CR4 “English, A. C. et al. Analysis and benchmarking of small and large genomic variants across tandem r…
Main
Short tandem repeats (STRs) of 1–6 bp of DNA are mutable genomic elements with diverse influences on cellular and organismal phenotypes2. Common STR polymorphisms, which have been characterized in human populations using short-read3 and long-read4 sequencing, influence gene expression5 and complex traits6,7. Rare STR expansions cause more than 60 genetic disorders1. The allelic diversity that underlies these effects is generated by frequent mutation: around 1 million polymorphic STRs in the human genome generate around 50–60 de novo repeat-length mutations per offspring8,9,10. Germline mutation rates of specific STRs vary widely11 and are influenced by repeat motif sequence, interruptions of pure repeats and number of repeat units8,9,10,11,12 as well as genetic variation in DNA-repair genes9.
STRs are also prone to somatic mutation2, and lifelong somatic expansion in at least one STR locus can lead to disease. Recently, genome-wide association studies (GWASs) have provided insights into the molecular mechanisms underlying somatic repeat instability13 by finding common genetic modifiers of the timing or progression of Huntington’s disease (HD)14,15,16,17,18,19, which is caused by inherited alleles in which a CAG repeat in the HTT gene is longer than 35 CAGs; these genetic modifiers were found in many DNA-repair genes that affect the stability of DNA repeats14,15,16,17,18,19. Neurodegeneration in HD was subsequently found to be caused by somatic expansion of this repeat beyond a high threshold of about 150 CAG repeats20. The genetic-modifier studies, so far of up to 16,640 persons with HD, have provided early clues toward a few potential therapeutic targets for slowing or halting somatic expansion of DNA repeats; however, the number of such potential targets is so far modest.
Whole-genome sequencing (WGS) of biobank cohorts offers opportunities to study repeat instability in much larger sample sizes than previously possible. Here we analysed repeat instability at 356,131 polymorphic repeat loci using short-read WGS data from the blood-derived DNA of 490,416 participants in UK Biobank (UKB)21 and 414,830 participants in All of Us (AoU)22. To do so, we developed several computational techniques, overcoming challenges in estimating the length and instability of DNA repeats from large numbers of short WGS reads23. These methods enabled us to characterize allele-specific expansion and contraction rates of common repeats, identify genetic influences on somatic repeat expansion and identify associations of expanded repeats with diseases.
CAG-repeat expansions in the UKB
We began by analysing CAG trinucleotide repeats, which we could efficiently ascertain from biobank sequencing data and which cause many progressive, neurodegenerative repeat-expansion disorders1,2. We identified UKB participants with long CAG-repeat alleles (≥45 repeat units) by analysing WGS data for 151 bp sequencing reads comprised entirely or almost entirely of CAG-repeat units (in-repeat reads (IRRs); Extended Data Fig. 1a). Such reads were easily extractable, as nearly all of them had been aligned to the TCF4 CAG-repeat sequence by bwa24 (Supplementary Fig. 1). For each participant with one or more IRRs, we determined the locus or loci from which the IRRs originated by identifying mate sequences that mapped near one of 1,159 commonly polymorphic CAG-repeat loci3.
The vast majority of CAG-repeat expansions in the UKB occurred at only a few loci: 18 autosomal CAG-repeat sequences in the human genome were expanded to at least 45 repeat units in at least five UKB participants (Extended Data Fig. 1b and Supplementary Table 1). Three repeat loci were expanded in thousands of UKB participants—CA10 (137,673 participants), TCF4 (42,004) and ATXN8OS (7,736)—together accounting for 97% of all observed expansions beyond 45 repeat units. Most of these repeats (15 out of 18) were in transcribed genomic regions, consistent with the idea that transcription contributes to repeat instability25 (Supplementary Table 2). For 9 out of the 18 repeats, expanded alleles are known to be pathogenic1.
To study the mutability of these repeats, we measured the lengths of common, short alleles of each repeat (≤30 repeat units) by analysing sequencing reads that spanned repeat alleles, focusing on 15 repeat loci that passed additional filters (Extended Data Fig. 1 and Supplementary Table 1). These analyses recovered repeat-length distributions consistent with previous analyses26 (Extended Data Fig. 1b).
Germline instability of common CAG repeats
We first analysed germline instability of these repeats, using the large UKB cohort to obtain high-resolution estimates of germline mutability (providing context for analyses of somatic mutability). To estimate allele-specific intergenerational expansion and contraction rates of each repeat, we analysed length discordances among alleles belonging to genomic tracts inherited identical-by-descent (IBD) from shared ancestors, building on IBD-based analyses of single-nucleotide mutations27,28,29 (Fig. 1a). We validated this approach using two complementary methods (Supplementary Fig. 2).
Fig. 1: Germline and somatic instability of common CAG-repeat alleles.
a, Germline mutation rates were estimated by analysing discordance rates among alleles inherited within IBD tracts shared by pairs of UKB participants. Ancestral alleles were imputed from more-distantly shared haplotypes. b, Per-generation rates of germline expansion (+1 repeat unit) and contraction (−1 repeat unit) of GLS and TCF4 repeat alleles, estimated in the UKB. c, The analytical strategy for estimating somatic mutation rates by detecting and filtering out reads that are likely to reflect PCR artifacts introduced during sequencing. During PCR-based bridge amplification on a flow cell, a DNA fragment is clonally amplified into a cluster of colocalized DNA molecules. A PCR stutter error results in a polyclonal cluster containing a mixture of DNA molecules with and without the error. If the molecules containing the error constitute the majority of the cluster, the sequencing read generated from the cluster (reflecting the majority base at each position within the read) will contain the error, but the heterogeneity of the cluster will reduce base qualities at positions within the read that mismatch between molecules with and without the error. d, The rates of somatic expansion of GLS and TCF4 repeat alleles (that is, the fractions of blood cells in which an allele has expanded by +1 repeat unit), stratified by age in AoU. e, Somatic mutation rates in the UKB plotted against germline mutation rates for GLS and TCF4 repeat alleles. The error bars show the 95% confidence intervals (CIs). Sample sizes are provided in Supplementary Table 3.
Across all 15 CAG-repeat loci, intergenerational mutation rates increased with allele length, rising to 0.5–0.9% per generation for single-repeat-unit expansions of the longest common alleles of repeats in GLS, DMPK and ATXN8OS (Extended Data Figs. 1b and 2). The average mutation rate per locus ranged from 8.2 × 10−5 to 9.5 × 10−4 (Supplementary Table 3). These rates are relatively high for trinucleotide repeats8 and exceed the genome-wide average for STRs (around 5 × 10−5 per haplotype per generation)8,9,10. Repeat loci tended to either expand more often than contract (particularly so for ATXN8OS and GLS) or to have similar expansion and contraction rates (Extended Data Figs. 1b and 2). Interruptions of repeat sequences (that is, intrarepeat sequence variants) greatly stabilized alleles: a common 18-repeat TCF4 allele containing an interruption in its ninth repeat unit exhibited a 135-fold (54–336, 95% CI) lower expansion rate compared with the uninterrupted 18-repeat allele, and an interruption in the second-to-last repeat unit of a 19-repeat GLS allele decreased the expansion rate 3.7-fold (1.9–7.2, 95% CI) (Fig. 1b). These results corroborate previous observations that repeat interruptions stabilize the expansion of pathogenic alleles30,31,32 and quantify the strength of such effects in the germline.
Somatic expansion of common CAG repeats
These high rates of germline instability led us to wonder whether common alleles of some repeats might be sufficiently unstable in blood cells for somatic length-change mutations to be ascertainable in short-read WGS data. Identifying such mutations is challenging because polymerase slippage during PCR amplification can spuriously alter repeat lengths33,34,35. Such ‘PCR stutter’ errors are unavoidable during Illumina sequencing by synthesis, which uses PCR for bridge amplification of DNA fragments36. However, we realized that this PCR error mode tends to produce predictable patterns of reduced base quality scores within sequencing reads, enabling us to detect and exclude most reads with artefactual CAG length mutations (Fig. 1c and Supplementary Fig. 3). We applied this filtering strategy in the UKB to estimate repeat-specific, allele-specific somatic expansion rates, which we quantified as the average fraction of blood cells in which a given repeat allele has expanded by one repeat unit.
For 4 out of the 15 CAG repeats (in TCF4, GLS, DMPK and ATN1), we detected significant increases in somatic single-repeat-unit expansion rates with age (Extended Data Fig. 3). These findings were replicated in AoU, in which the wider age range of participants (aged 18 to 90+ years) revealed clear increases in fractions of blood cells containing somatic expansions with increasing age and with increasing allele length (Fig. 1d and Extended Data Fig. 4). TCF4 repeats were the most somatically unstable: individuals carrying alleles with 25 or more repeat units typically exhibited somatic expansion in more than 1% of blood cells by the age of 55 years (Fig. 1d). We did not observe age-associated contraction of any of the 15 repeat loci.
Comparing these estimates of somatic one-repeat-unit expansion rates with our estimates of intergenerational mutation rates showed that the relative (blood/germline) rates of CAG-repeat expansion varied severalfold across repeat loci (Fig. 1e). The TCF4 repeat exhibited the greatest somatic instability in blood but was relatively stable in the germline, whereas the GLS repeat displayed the opposite behaviour (Fig. 1e), as did the DMPK repeat (Extended Data Figs. 1b, 2 and 4). These results align with observations that somatic instability of pathogenic repeat expansions is highly tissue-specific, perhaps due to differences in transcription or trans-acting factors25,37,38,39,40,41. Consistent with the former hypothesis, the four repeats for which we detected instability in blood are in genes with significantly higher expression in blood (Wilcoxon rank-sum test, P = 0.034; note that all P values reported in this Article were calculated using two-sided statistical tests; Supplementary Fig. 4).
Somatic expansion of long TCF4 CAG repeats
The high somatic expansion rates of TCF4 repeat alleles—even those of shorter lengths—suggested the possibility that long TCF4 alleles (≥45 repeat units) might be sufficiently unstable in blood to allow individual-level phenotyping of somatic expansion using short-read WGS data. This would provide an opportunity to learn about instability of long repeats from somatic expansions in very many people—potentially enabling the identification of genetic modifiers of repeat instability13,14,15,16,17,18,19,42,43—as long TCF4 alleles are common (42,004 carriers in the UKB; Extended Data Fig. 1b).
A barrier to analysing repeat expansions from short-read WGS data is that alleles exceeding the length of a sequencing read (151 bp) cannot be directly sized. However, short-read WGS data does permit rough estimation of the length of a long allele by counting in-repeat reads44 (Extended Data Fig. 1a). In an individual who is mosaic for somatic expansions that vary across cells, this approach estimates the average length of expanded alleles.
We analysed somatic expansion of long TCF4 alleles in the UKB and AoU by applying this approach with two methodological improvements. First, to control for variation in lengths of inherited TCF4 alleles, we used imputation to calibrate each individual’s allele length against measurements from other individuals sharing the same inherited allele (in lieu of longitudinal measurements). Stratifying individuals by imputed allele length showed that somatic expansion accelerates rapidly with TCF4 allele size, reaching around 1 repeat unit per year for 100-repeat alleles (Extended Data Fig. 5a). Second, to reduce noise in estimates of long TCF4 allele lengths, we devised a better-powered metric based on the number of sequenced DNA fragments derived from a highly expanded repeat (Extended Data Figs. 5b and 6a,b). Long-read sequencing of blood-derived DNA from AoU participants (n = 1,027, of whom 28 had long TCF4 alleles) corroborated TCF4 allele-length estimates from short-read WGS and demonstrated extensive mosaicism of expanded alleles41,45 (Extended Data Fig. 5c).
Genetic modifiers of TCF4 repeat expansion
Genome-wide association analysis of an optimized TCF4 somatic-expansion phenotype (Extended Data Fig. 6c) in 48,448 UKB and AoU participants identified seven loci at which common variants modulate TCF4 repeat expansion in blood (P < 5 × 10−8; Fig. 2a and Supplementary Table 4). Four loci—at MSH3 (P = 2.0 × 10−52), FAN1 (P = 8.5 × 10−29), ATAD5 (P = 4.9 × 10−12) and PMS2 (P = 3.0 × 10−8)—overlapped DNA-repair and DNA-damage-response genes that were recently implicated in somatic expansion of the HTT CAG repeat in blood18 (Fig. 2a). The three other modifier loci included GADD45A (P = 2.9 × 10−8), which encodes a growth arrest and DNA damage protein that binds to R-loops46.
Fig. 2: Genetic influences on somatic expansion of TCF4 repeat alleles in blood.
a, Genome-wide associations with somatic instability of long TCF4 repeat alleles in the blood (top, meta-analysed across the UKB (n = 40,231) and AoU (n = 8,217)) compared with genetic associations with somatic instability of pathogenic HTT repeat alleles in the blood (bottom; from ref. 18). The TCF4 locus is shown in grey because these associations could reflect imperfect control for inherited TCF4 allele length. b, Comparison of the effect sizes of variants at MSH3, PMS2, FAN1 and ATAD5 for somatic expansion of HTT repeats in the blood (quantified by the somatic expansion ratio; SER18) versus TCF4 repeats in the blood. c, Analogous comparison for variant effect sizes for hastening of an HD clinical landmark of cognitive decline (symbol digit modalities test, SDMT)18. In each plot, variants within 1 Mb of the lead variant for TCF4 somatic expansion are plotted in black if they reached P < 10−5 for association with at least one of the two phenotypes; for FAN1, a subset of these variants is plotted in red or blue according to linkage disequilibrium with the two low-frequency FAN1 missense variants (r2 > 0.05). Variants with P > 10−5 for both phenotypes are plotted in light grey.
Comparing genetic modifiers of somatic expansion of the TCF4 and HTT CAG repeats in blood revealed both consistency and heterogeneity of effects (Fig. 2a,b). Common haplotypes at PMS2, FAN1 and ATAD5 were associated with broadly concordant effects on TCF4 and HTT repeat expansion in blood, whereas at MSH3, common haplotypes that decreased expansion of the TCF4 repeat tended to increase expansion of the HTT repeat in blood (Fig. 2b). Moreover, the strongest modifier of HTT expansion in blood—a haplotype containing a missense variant in MSH2 also implicated in germline STR mutation9—appeared not to affect TCF4 expansion (P = 0.96; Fig. 2a and Supplementary Table 5).
Similarly, comparing genetic modifiers of TCF4 repeat expansion in the blood to genetic modifiers of HD age-at-landmark phenotypes18 (which are probably regulating HTT repeat expansion in neurons) showed that, at both MSH3 and FAN1, common haplotypes that decreased expansion of the TCF4 repeat in blood appeared to increase the expansion of the HTT repeat in the brain (Fig. 2c and Supplementary Table 6). By contrast, two missense variants that reduce FAN1 activity47 appeared to increase the expansion of both the TCF4 repeat (in blood) and HTT repeat (in brain) (Fig. 2c). These results suggest that the tissue-specific instability of many trinucleotide repeats37,38,39,40,41 may arise from complex regulation of mismatch repair processes that differs across cell types18 and even across repeat loci, perhaps interacting with locus-specific differences in chromatin structure or other epigenomic properties.
We also compared modifiers of TCF4 repeat expansion in blood to loci that influence risk of Fuchs endothelial corneal dystrophy (FECD), a common age-associated eye disorder that is thought to be caused (in most cases) by expansion of the TCF4 repeat in corneal endothelial cells48,49. Notably, no modifiers of TCF4 repeat expansion in the blood overlapped with FECD risk loci50, and none of our lead variants for TCF4 blood instability (Supplementary Table 4) were associated with FECD (P > 0.15) in a recent well-powered GWAS50,51. Moreover, FECD risk conferred by long TCF4 repeats appeared to plateau for allele lengths beyond around 75 repeat units (Extended Data Fig. 7). Further work will be required to determine whether the instability-modifying genetic effects that we identified are specific to blood (which is conceivable given the very different (more extreme) dynamics of TCF4 somatic expansion in corneal endothelium41) and whether any modifiers of somatic expansion influence age at FECD onset.
Varied genetic effects on instability of 17 STRs
A much-larger set of DNA repeats involves other (non-CAG) sequence motifs, and the above results motivated us to investigate their expansion. To this end, we developed a computationally efficient tool for extracting IRRs with any 2–6 bp motif from WGS read alignments and applied it to the UKB WGS data. Mapping these IRRs to 356,131 polymorphic STRs identified 154 STRs for which long repeat alleles (>150 bp) were common (>0.5% carrier frequency; Supplementary Data 1). We constructed somatic-expansion phenotypes from IRR counts for these repeats, controlling for inherited allele lengths (inferred from imputation) as before. To identify STRs with evidence of somatic expansion, we tested these phenotypes for association with age or with the MSH3 and MSH2 haplotypes that were most strongly associated with blood instability of TCF4 and HTT repeats. This approach was motivated by initial GWAS analyses on STRs with somatic-expansion phenotypes that associated with age (Supplementary Methods): in these analyses, MSH2 and MSH3 haplotypes were consistently the lead variants and sometimes associated more strongly than age, suggesting that, for some repeats, the effects of genetic modifiers—which act across an individual’s years of life (mean 56.5 years in the UKB)—might be easier to detect than the effects of age differences (s.d., 8 years) on somatic expansion.
These analyses identified 17 STRs for which one or more of the three tests suggested evidence of somatic instability (P < 0.0001; Fig. 3a and Supplementary Data 2). These 17 unstable STRs represented 7 distinct 2–5 bp repeat motifs. Half of these STRs were located in genes that are highly expressed in blood, while six appeared to be in untranscribed regions (Fig. 3a). At some unstable STRs, expanded alleles were very common. Long alleles of an intronic AAAG tetranucleotide repeat in ADGRE2 (carried by 49% of European-ancestry UKB participants) expanded at an average rate of 0.4 repeat units per decade, demonstrating that human genomes commonly contain repeat elements that expand as we age (Fig. 3b).
Fig. 3: Variation in genetic influences on 17 unstable STRs.
a, Genomic context, population frequencies (freq) among 420,522 unrelated European-ancestry UKB participants, associations (assoc.) with MSH2 and MSH3 variants and age, and the relative contributions of genetic modifiers of instability of 17 STRs. Prom., promoter. The relative contributions of five genetic modifier loci were estimated using local heritability analyses for STRs with sufficient signal (specifically, local heritability z-score > 2.5 for at least one of the five modifiers). b, The mean lengths of long alleles among UKB participants heterozygous for a long allele (based on imputation; from left to right, n = 155,291, 81,387, 257,934 and 240,294), stratified by age quintile and by genotype of an instability-modifying haplotype. The two STRs with the strongest age association and the two with the strongest genetic associations are shown. Error bars show the 95% CIs. c, Associations of modifier haplotypes of DNA-repair and DNA-damage-response genes with blood instability of 10 STRs (including HTT18) and with hastening of four HD clinical phenotypes18. The table cells contain z statistics from association analyses; associations with P < 0.05, 0.01 and 0.001 are shaded in green or purple depending on whether the effect size agrees or disagrees with the consensus effect direction (cons. effect dir.) for blood instability. The effect sign in each table cell corresponds to the direction of effect on repeat instability: a positive effect indicates that the alternate allele associated with increased somatic expansion or hastening of an HD clinical landmark. TFC6, a score of 6 on the 13-point total functional capacity scale; TMS, total motor score.
GWAS of somatic-expansion phenotypes for the 17 unstable STRs identified 7 loci at which common variants appear to modulate instability of these repeats in blood cells (P = 1.2 × 10−9 to 1.4 × 10−878; Extended Data Fig. 8 and Supplementary Data 3). Variants in four mismatch repair genes (MLH3, MSH3, MSH2 and PMS2) were each associated with somatic expansion of three or more STRs. The relative contributions of these genes to repeat instability varied across STRs, with MSH2 variation having greater influences on dinucleotide repeats and MSH3 variation having greater influences on STRs with longer motifs (Fig. 3a and Supplementary Data 4), consistent with MutSβ (a heterodimer of MSH2 and MSH3) having higher affinity for longer insertion–deletion loops in DNA52,53. Across a broader set of modifier haplotypes identified by fine-mapping genetic associations with optimized somatic-expansion phenotypes (see below), different modifiers appeared to influence different subsets of STRs (Fig. 3c) but generally with a consistent effect direction, with the exception of several opposite-direction effects on somatic expansion of the TCF4 repeat (Fig. 3c). Multiple modifiers were associated with opposite-direction effects on STR expansion in the blood compared with HTT repeat expansion in the brain (as inferred from the timing of HD phenotypes), consistent with recent findings18 (Fig. 3c).
Genetic determinants of AAAG-repeat expansions
Somatic expansion of AAAG repeats at two loci (at chromosome 2: 232.4 Mb and chromosome 19: 14.8 Mb) was particularly strongly shaped by inherited variation, prompting deeper analyses. Mid-length alleles of these repeats (19–26 repeat units) were sufficiently common and unstable for somatic expansions to often be directly observable from spanning reads, enabling us to construct mid-length-allele somatic-expansion phenotypes for these two STRs in the UKB and AoU. We also optimized the common chromosome 19: 14.8 Mb (ADGRE2) long-allele somatic-expansion phenotype in the UKB to increase GWAS power.
Common and low-frequency variants (minor allele frequency (MAF) > 0.1%) at 26 loci were associated with these AAAG somatic-expansion phenotypes (P = 5 × 10−8 to 2.5 × 10−1,438; Fig. [4a](https://www.nature.com/articles/s41586-025-09886-z#F