Main
In 1964, Anthony Epstein, Yvonne Barr and Burt Achong observed actively replicating viral particles from Burkitt lymphoma, discovering the virus that now bears their names: the Epstein–Barr virus (EBV)5. EBV was subsequently recognized as the first known human oncogenic virus6, the cause of infectious mononucleosis[6](https://www.nature.com/articles/s41586-025-10020-2#ref…
Main
In 1964, Anthony Epstein, Yvonne Barr and Burt Achong observed actively replicating viral particles from Burkitt lymphoma, discovering the virus that now bears their names: the Epstein–Barr virus (EBV)5. EBV was subsequently recognized as the first known human oncogenic virus6, the cause of infectious mononucleosis6, and an agent in developing and exacerbating multiple autoimmune diseases7. Despite these wide-ranging pathogenic roles, EBV infection is nearly ubiquitous, infecting >90% of adults worldwide, with most individuals remaining asymptomatic8. EBV primarily transmits via saliva, infecting oral epithelial cells, spreading to B cells and establishing persistent infections in the human host that can last for a lifetime9. Why clinical outcomes of EBV infection—ranging from asymptomatic infection to severe disease—vary so widely remains poorly understood. The most severe manifestation, EBV-triggered cancers, collectively account for 130,000–200,000 annual deaths worldwide10. By contrast, immunocompetent individuals may harbour latent EBV within peripheral memory B cells, where the virus expresses a minimal gene program4,11. As with other herpesvirus infections, EBV can reactivate sporadically or in response to acute stressors or host immunosuppression, resulting in expanded viral reservoirs and potentially lethal clinical complications11,12. This vast phenotypic spectrum following acute and chronic infection underscores extensive individual variability, which can be partially attributed to host genetic variation1,2,3. However, genetic association studies of common infections with complex phenotypes such as EBV have been underpowered owing to small cohort sizes13, motivating new approaches to study infection, viral persistence and host–phenotype associations.
Beyond its role in human disease, EBV has been instrumental in advancing human population genetics research. EBV can transform primary B lymphocytes from healthy individuals into immortalized lymphoblastoid cell lines (LCLs)14, critical resources that historically enabled long-term storage and large-scale genetic studies15. Consequently, immortalized LCLs were the primary material used in the HapMap16 and 1000 Genomes15 projects to profile genetic variation across the globe. These foundational efforts laid the groundwork for more expansive population-scale cohorts, such as the UK Biobank (UKB)17 and All of Us (AOU)18, which include sequencing and phenotypic data from hundreds of thousands of individuals: a scale that can interrogate the genetic underpinnings of complex phenotypes following infection.
As modern biobanks perform whole genome sequencing (WGS) on peripheral blood rather than on LCLs, we posited that EBV DNA reflecting EBV persistence in circulating cells could be captured and quantified in these libraries. Building on recent work that quantifies viral nucleic acids in petabyte-scale datasets to infer host–virus interactions retrospectively19,20, we sought to develop a scalable computational pipeline to estimate individual-level EBV DNA loads. By leveraging the inclusion of the EBV genome as a contig in the human reference genome, we demonstrate how ordinarily excluded sequencing reads can be reanalysed to create a new molecular feature for genome-wide and phenome-wide association studies at petabase-scale.
Biobank WGS data harbour EBV DNA
To address the high levels of EBV DNA present in the LCL-derived libraries—including those used in foundational efforts such as the 1000 Genomes Project15—the EBV genome (chrEBV, NC_007605) was incorporated into the human reference genome assembly (as of hg38)21. This alternative contig was designated as a sink for viral nucleic acids to improve variant calling and interpretation in the human genome21. We hypothesized that reads mapping to this contig from blood-derived WGS data would reflect persistence of EBV DNA following a primary infection. We thus extracted all sequencing reads from the aligned .cram files that mapped to chrEBV, enabling a quantification of the per-individual, per-base EBV DNA coverage across 490,560 individuals in the UKB (Fig. 1a,b). In addition to regions with low coverage corresponding to poor mappability, we identified two distinct loci with disproportionately high read depths, which corresponded to repetitive sequences (Fig. 1b and Methods). As these regions were covered at levels that were orders of magnitude higher than the median coverage of the viral contig, we reasoned that they would confound EBV DNA quantification. As an orthogonal measure of past infection, we used EBV serostatus ascertained on a subset of 9,687 individuals from the UKB, noting that EBV seropositivity requires sufficient antibody titres for at least two of four EBV antigens. We observed a nominal association between presence of EBV DNA and seropositivity when including these two repetitive regions (Fisher’s exact test odds ratio = 1.2, P = 0.03; Fig. 1c and Methods); however, discarding these two repetitive regions revealed that >40% of the UKB cohort only had aligned reads in these regions, and masking these regions before binarizing individuals resulted in a markedly stronger association (Fisher’s exact test odds ratio = 14.6, P = 1.7 × 10−26; Fig. 1c). The next strongest association of detected EBV DNA with serostatus was for human immunodeficiency virus (HIV) 1 (Fisher’s exact test odds ratio = 4.6, P = 0.0023), consistent with reports of EBV DNA detection in blood following immunosuppression due to HIV 22 (Fig. 1d). Taken together, our sequencing-based approach readily scales to hundreds of thousands of individuals: a more than a 100-fold increase in sample size compared with serology-based association studies13.
Fig. 1: Retrospective quantification of EBV DNA in the UKB.
a, Schematic of the approach. WGS libraries from peripheral blood were aligned to the hg38 reference genome, which contains an EBV reference contig (chrEBV). Reads mapping to chrEBV were extracted for downstream analyses. b, Sum of per-base read coverage of high-confidence EBV-mapping reads. Two repetitive regions with inflated coverage are noted in red and purple (following IUPAC convention: h = A/C/T, y = C/T, m = A/C, r = A/G; subscripts indicate the number of repeats). c, Association summary of individual-level serostatus and EBV DNA quantification with variable region masking. Statistical test: two-sided Fisher’s exact test. Error bars represent 95% confidence intervals for the point effect estimate (centre dot). d, Summary of EBV DNA detection with serostatus of 22 infectious agents. Statistical test: two-sided Fisher’s exact test. HHV-6 was partitioned into strains HHV-6A and HHV-6B. e, Empirical cumulative distribution of detected EBV DNA across the entire cohort (85.7% of individuals had no detectable (n.d.) EBV DNA; 0.3% had EBV DNA at a copy number of 1+ EBV genome per 1,000 human cells). f, Top 100 individuals on the basis of EBV DNA copy number, from the circled population in e. g, Association between EBV seropositivity and EBV DNA detection thresholds at variable levels. Statistical test: two-sided Fisher’s exact test. Sample size of full UKB cohort: n = 490,560. The images in panel a were adapted from ref. 19, Springer Nature Ltd.
To further interpret our metric, we estimated the EBV DNA copy number per 1,000 cells by normalizing read counts between viral and host genome sizes. At the extremes of the distribution, 85.7% of individuals had no detectable bias-corrected EBV DNA, whereas 0.3% exhibited EBV DNA copy numbers of at least one viral genome per 1,000 human cells, including one individual with at least one EBV genome per 100 human cells (Fig. 1e,f). This range—which is derived from predominantly healthy individuals—is consistent with past quantitative polymerase chain reaction (qPCR)-based measurements of EBV DNA copy numbers in healthy populations, which reported upper ranges of one copy per 200 cells23 (Methods). Using serostatus as a ground truth and accounting for standard covariates, a cutoff of 1.2 viral genomes per 104 human cells yielded the strongest concordance with seropositivity (odds ratio = 82.2, P = 2.2 × 10−16), noting that all donors with detectable EBV DNA had at least one positive response against the four tested EBV antigens (Fig. 1g, Extended Data Fig. 1b and Methods). We classified 47,452 (9.7%) individuals with EBV DNAemia (defined as detectable EBV DNA levels >1.2 genomes per 104 cells) for subsequent analyses (Extended Data Fig. 1c). As the proportion of individuals with EBV DNAemia (9.7%) is lower than the seropositivity rate (>90%) in the UKB, we interpret our metric as capturing the subset of individuals with the highest levels of circulating EBV DNA at the time of WGS sampling. Indeed, simulated data from a censored log-normal distribution of per-person EBV DNA levels closely approximated the empirical distribution (Extended Data Fig. 1d–g and Methods).
Next, we sought to better understand the profile of individuals with EBV DNAemia in the UKB cohort (Supplementary Table 1). Annotating each individual by birth location, we observed a higher proportion of EBV DNAemia in individuals born in more northern latitudes in the UK, consistent with previous reports of increased EBV infection further from the equator24 (Extended Data Fig. 1h). We also observed a sex-biased (higher in male) and age-associated increase in EBV DNAemia rates, the latter consistent with EBV serology (Extended Data Fig. 1i). EBV DNAemia rates also differed among genetic ancestries and had a modest increase among individuals taking immunosuppressive medications (Extended Data Fig. 1j,k and Supplementary Table 2). We performed parallel analyses in the AOU cohort, spanning 245,394 individuals with blood-derived WGS (Extended Data Fig. 2a and Methods). Results from the independent analyses of AOU replicated key attributes of the UKB data, including a clear repetitive region that was similarly masked, yielding 11.9% of individuals with EBV DNAemia and consistent associations with age, sex, genetic ancestry and prescription of immunosuppressive drugs (Extended Data Fig. 2b–f).
As primary EBV infection occurs earlier in life25, we hypothesized that donors with EBV DNAemia probably reflected a previous infection that persisted until sampling. Conversely, lytic herpesvirus infection would be concomitant with viral transcription, including in peripheral blood11,19. As the UKB and AOU collected DNA but not RNA-seq data, we reprocessed bulk and single-cell RNA-seq from the OneK1K26 and Genotype-Tissue Expression27 consortia to assess for EBV transcription in peripheral blood cells (Supplementary Note 1 and Supplementary Fig. 1a,f). Across these 1,663 donors, we detected minimal evidence of EBV transcripts, suggesting that the vast majority of blood-derived EBV DNA from our cohorts probably reflects latent infection, which is concordant with the lack of EBV lytic reactivation gene expression detected in peripheral B cells of healthy individuals28 (Supplementary Note 1 and Supplementary Fig. 1c,g). Furthermore, analysed saliva-derived WGS samples for another set of 48,899 AOU participants showed a markedly higher rate of EBV DNAemia (50.9%), reflecting a distinct environmental and cellular reservoir for EBV (Supplementary Note 2 and Supplementary Fig. 2a–d). Together, our findings demonstrate that EBV DNA can be retrospectively quantified from existing large-scale WGS datasets with reproducible signals, including sequences collected from different anatomical sites.
Associations with complex traits
Next, we investigated whether our WGS-enabled measure of EBV DNAemia could serve as a biomarker of complex disease. To assess this, we performed a phenome-wide association study (PheWAS) to map systematic outcomes catalogued via International Classification of Diseases, 10th revision (ICD-10) codes with EBV DNAemia as an exposure (Methods). Using individuals from the UKB of predominantly non-Finnish European (NFE) genetic ancestry (n = 426,563) as a discovery cohort, we tested for the association between EBV DNAemia and 13,290 binary phenotypes as well as 1,931 quantitative phenotypes, following our previously described PheWAS workflow29 (Supplementary Table 3 and Methods). Among binary traits, we observed 271 significant (P < 3.3 × 10−6) ICD-10 codes, including well-established associations with splenic diseases and Hodgkin lymphoma. We also observed significant associations with rheumatoid arthritis11, chronic obstructive pulmonary disease (COPD30) and systemic lupus erythematosus28, each of which has been previously associated with EBV using orthogonal approaches (Fig. 2a). Past case studies have anecdotally reported associations between EBV infection and various conditions in small-scale studies relative to our population-scale cohorts. Our analyses reinforced evidence for these relationships, including chronic ischemic heart disease (odds ratio = 1.19, P = 2.8 × 10−18), acute kidney failure (odds ratio = 1.21, P = 1.4 × 10−16), depressive episodes (odds ratio = 1.19, P = 4.0 × 10−26) and stroke (odds ratio = 1.20, P = 6.1 × 10−13). We emphasize that these associations may also reflect a general state of immunosuppression, and additional work is required to determine which of these associations are causal rather than correlational.
Fig. 2: EBV DNA is a biomarker of complex traits.
a, Summary of associations between EBV DNAemia and binary phenotypic traits in the UKB, with individuals of broadly NFE ancestry. The horizontal dashed line represents the phenome-wide significant P-value threshold (3.3 × 10−6). The y axis is capped at –log10(P) = 50, with those exceeding this threshold plotted at 50. Union phenotypes are plotted to reduce redundancy. Selected traits are highlighted based on biological interest. Statistical test: Wald test from logistic regression model (two-sided). b, Effect sizes for matching ICD-10 codes between the UKB and AOU, with individuals of European ancestry in AOU. Dotted lines at odds ratio = 1 represent null associations. Sample size: n = 426,563 UKB NFE individuals; n = 133,578 for AOU European ancestry individuals. The colours are the same as in a.
Statistically significant quantitative associations (n = 156) included leukocyte count, neutrophil percentage, smoking pack years, telomere length and compositions of omega-3 fatty acids, consistent with previous observations of lipogenesis induction following EBV infection31 (Extended Data Fig. 3a). We also detected an association with malaise and fatigue (odds ratio = 1.27, P = 2.06 × 10−10), noting that EBV has long been hypothesized as a risk factor for myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS)32. We also identified significant associations with decreased levels of phosphatidylcholine (P = 2.9 × 10−9) and total choline (P = 5.9 × 10−9), consistent with metabolic studies in patients with ME/CFS33. Our results reinforce a potential relationship between EBV and ME/CFS that warrants further examination.
We sought to replicate these associations using the AOU cohort (Fig. 2b). As the underlying electronic health record data vary between cohorts, we focused on 141 significantly associated ICD-10 codes in the UKB that had sufficient representation in AOU (minimum n = 24 cases). Of these, 87 (62%) were replicated in AOU (P < 0.05; odds ratio directionally concordant with UKB statistics), resulting in a set of traits that we examined more closely (Methods, Supplementary Table 3 and Supplementary Note 3). These phenotypes included rheumatoid arthritis, COPD and lung neoplasms, as well as less-established phenotypes such as peripheral vascular disease, emphysema and tachycardia, some of which may be attributable to the association between smoking and EBV reactivation. We also considered two traits that were previously linked to EBV but were not significant in either cohort (Methods and Extended Data Fig. 3b). For multiple sclerosis, we observed nominal associations that did not survive multiple testing corrections (UKB, odds ratio = 2.1, P = 0.019; AOU, odds ratio = 0.73, P = 0.0087), consistent with a past report that did not detect a significant association using ICD-based viral exposure measures34. For gammaherpesviral mononucleosis, a primary manifestation of EBV infection, the association was in the expected direction (UKB, odds ratio = 2.55, P = 0.23; AOU, odds ratio = 5.86, P = 1.1 × 10−6) but underpowered owing to low sample sizes (n = 11 in the UKB, n = 42 in AOU), noting infectious mononucleosis primarily affects younger individuals.
In addition to phenotypes that replicated between cohorts, we noted instances of neurological conditions that were nominally associated with EBV DNAemia in the UKB but lacked sufficient case numbers to be assessed in AOU (P < 0.05; Extended Data Fig. 3c). These included all-cause dementia (odds ratio = 1.16, P = 6.0 × 10−5); rarer phenotypes such as neuromyelitis optica, which is a rare autoimmune disease with similar clinical presentation as multiple sclerosis (odds ratio = 6.31, P = 2.7 × 10−3); and acute disseminated demyelination (odds ratio = 6.31, P = 5.3 × 10−3). Although further work is required to implicate the role of EBV in these phenotypes, our scalable approach enables systematic association studies across a broad range of conditions, including rare diseases for which very large cohorts such as the UKB and AOU are essential.
Genetic variation underlies EBV DNAemia
Past studies have established that manifestations of viral infections are a polygenic trait controlled by dozens of loci in the human genome13,35. Hence, we reasoned that genetic variation would similarly influence the variable degree of EBV persistence across the population. We thus conducted a genome-wide association study (GWAS) on individuals of NFE ancestry (~94% of the UKB cohort) to identify loci associated with EBV DNAemia (Methods). Using array-based genotype data followed by imputation from 426,563 NFE individuals in the UKB, we identified 22 independent loci associated with EBV DNAemia that reached genome-wide significance (P < 5 × 10−8; Fig. 3a, Methods and Supplementary Table 4). Overall, the single nucleotide polymorphism (SNP)-based heritability (h**2) determined by LDscore regression (LDSC) was 2.21% (± 0.85%) with limited evidence of genomic inflation (λGC = 1.1; LDSC intercept = 1.03 ± 0.008; Supplementary Note 4). Partitioned heritability analyses showed an enrichment at conserved and non-coding loci marked by enhancer and super-enhancer annotations (Extended Data Fig. 4a), consistent with other complex trait associations36.
Fig. 3: Genetic architecture of EBV DNAemia.
a, Manhattan plot summarizing the genome-wide association statistics for EBV DNAemia for 426,563 individuals of predominantly NFE ancestry in the UKB. Genes proximal to genome-wide significant associations (P < 5 × 10−8) are annotated. Statistical test: likelihood ratio test from logistic regression model (two-sided). b, Summary of protein-altering variants in HLA genes. c, Replication of UKB-associated variants in the AOU European ancestry cohort. The Pearson correlation coefficient of variant effect sizes is noted. d, PCA and projection of EBV summary statistics on complex immune-mediated diseases via cupcake38. An asterisk indicates a significant PC projection score after multiple testing correction. Statistical test: Z-test (two-sided).
The strongest associations emerged near human leukocyte antigen (HLA) genes on chromosome 6 that encode the major histocompatibility complex (MHC) class I and II proteins (Fig. 3a). Major histocompatibility complex molecules are critical in differentiating between self and non-self proteins and have been widely associated with autoimmune traits13,37. We conducted an exome-wide association study (ExWAS), which included protein-coding variants observed at least six times (that is, with a minor allele count of greater than five) in the NFE cohort29, to refine association signals at the MHC and other associated loci (Methods). Associations at alleles assayed by either technology were concordant (Extended Data Fig. 4b). Among the 1,102 variants significantly associated with EBV DNAemia (P < 5 × 10−8, cases > 20), 686 were missense variants spanning 148 genes. These missense variants facilitated the annotation of putative causal variants at 9 of the 22 implicated loci (Fig. 3a, Methods and Supplementary Table 4). Consistent with our GWAS results, the protein-coding variants with the largest effect sizes were near the MHC locus, where 148 MHC class I, 113 MHC class II and 7 non-classical HLA protein-altering variants were significantly associated with EBV DNAemia (Fig. 3b).
We used the AOU cohort to replicate the biological plausibility and pleiotropy of genetic associations in the UKB. Repeating our GWAS framework on n = 131,938 people with European (EUR) ancestry in AOU for 12,099,305 common variants (1% minor allele frequency), we observed concordant associations at implicated loci. Globally, 40,675 variants were genome-wide significant (P < 5 × 10−8) in the UKB and passed quality control filters in AOU, noting that many were from the HLA region. Of these genome-wide significant variants, 91.4% of variants were replicated in the AOU GWAS (nominal P < 0.05; odds ratio concordant; Fig. 3c). Further, 12 of the 19 (63%) assayed index GWAS variants replicated in the AOU GWAS (nominal P < 0.05, odds ratio directionally concordant; Supplementary Table 4). These included loci near well-established immune-regulatory genes, including CTLA4, EOMES, LNPEP, PTPN22 and SLAMF7 (Supplementary Note 4 and Supplementary Fig. 3c–f). Although these analyses primarily focused on individuals of European ancestry, additional meta-analyses from the diverse ancestries of the UKB and AOU revealed an additional 23 loci surpassing genome-wide significance, including variants near BIM, GSDMB, TERT, BCL11A, MYC and* CD160* (Supplementary Note 5 and Supplementary Fig. 4a,b). Together, our results indicate that persistence of EBV DNA is a polygenic trait that can be quantified from multiple population-scale WGS datasets, and loci underlying EBV DNAemia are reproducible across continents.
Given the well-described associations between EBV and immune-mediated phenotypes, we sought to systematically evaluate similarities between the genomic architectures of EBV DNAemia and immune-mediated diseases (IMDs). We used cupcake38, a framework that accounts for the shared components of non-HLA genetic architecture across 13 IMDs using a shrinkage approach to adjust for linkage disequilibrium, allele frequency and differential sample size via principal component analysis (PCA), where principal component (PC)1 captures an IMD genetic axis characterized by autoantibody seropositivity38. Consistent with our PheWAS and past reports of EBV pathogenesis, we observed a cupcake PC1 score that reflects a shared component of genetic architecture between EBV DNAemia and autoimmune diseases such as rheumatoid arthritis, systemic lupus erythematosus and T1D from both the UKB (P = 1.3 × 10−5) and AOU (P = 4.8 × 10−7; Fig. 3d). Noting that initial EBV infections are most prevalent in adolescence25 and generally precede onset of autoimmunity39, our data refine a potential model in which a component of genetic architecture shared by seropositive IMDs may first determine the persistence of EBV after primary infection that, in turn, may trigger complications characteristic of disease.
As a contrast to our blood-derived EBV DNAemia biomarker, we conducted analogous genome-wide analyses of binarized EBV serology (seropositivity) from the UKB40 and saliva-derived EBV DNAemia from AOU18. Serostatus from 8,669 individuals of NFE ancestry resulted in zero genome-wide significant loci (Supplementary Note 5 and Supplementary Fig. 4c). Furthermore, although we observed markedly higher levels of EBV DNA in AOU saliva WGS samples, including 51% DNAemia in 32,745 saliva EUR ancestry donors, the only genome-wide significant association for these individuals under three different candidate models was at the MHC locus (Supplementary Note 2 and Supplementary Fig.2f,g). We attribute the disparity in the number of significant loci to the underlying biology of EBV DNAemia in peripheral blood, distinct from the site of transmission11. Whereas EBV serostatus reflects a history of any past infection, which is largely independent of genetic variation, EBV DNAemia identifies the subset of infected individuals with the highest levels of persistent viral DNA.
Cell type and pathway level analyses
To further evaluate the role of EBV DNAemia-associated immunomodulatory genes, we examined the expression of the 148 genes that harboured at least one significant ExWAS variant as a signature score in a multi-modal dataset of 211,000 human peripheral blood mononuclear cells (PBMCs)41 (Fig. 4a). As expected, the EBV signature score was enriched in B cells, consistent with the known viral tropism of EBV infection and latency8,9 (Fig. 4b,c). This enrichment was corroborated in the non-coding genome, as genome-wide significant variants were enriched in B cell-specific accessible chromatin from fluorescent-activated cell-sorting-isolated populations profiled via the Assay for Transposase Accessible Chromatin using sequencing42 (ATAC-seq; Extended Data Fig. 5a and Methods). We also observed a similar enrichment in subsets of antigen-presenting cells, particularly conventional dendritic cells (Extended Data Fig. 5b,c), although dendritic cells are most likely not directly infected by EBV43. To resolve the potential biological processes linked to this genetic architecture, we performed gene set analyses using the Gene Ontology biological processes and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses. Among the Gene Ontology biological-process-enriched terms, the top pathways involved antigen processing and presentation, MHC protein complex and assembly, and regulation of T cells (Fig. 4d, Extended Data Fig. 5d and Supplementary Table 5). From the KEGG enrichments, we observed disease-associated annotations that included viral myocarditis, rheumatoid arthritis, herpes simplex virus 1 (HSV-1) infection and, reassuringly, EBV infection (Extended Data Fig. 5e). As the strong linkage disequilibrium on chromosome 6 could drive this association, we refined these enrichments by further removing all HLA-associated genes or all genes on chromosome 6 (Methods). Regardless, antigen processing and presentation remained the most enriched term in our Gene Ontology biological processes analyses, underscoring the critical role of this pathway in controlling viral infection and clearance (Fig. 4e,f). Together, these analyses indicate that B cells and antigen-presenting cells are the primary cell types affected by the genetic architecture of EBV DNAemia, with viral antigen processing and presentation predominantly influencing the emergence and persistence of EBV DNA, a characterization consistent with the known roles of these immune cells in regulating herpesvirus infections.
Fig. 4: EBV DNAemia gene associations at cell and pathway resolutions.
a, Uniform manifold approximation and projection (UMAP) embedding of 211,000 PBMCs. The broad cell type label (Azimuth L1; ref. 41) annotates major populations. NK, natural killer; cDC, conventional dendritic cells. b, Module score of EBV ExWAS associations, highlighting populations with the highest enrichment. c, Summary of EBV ExWAS scores in major populations. *** Indicates statistical significance (P < 2.2 × 10−16 relative to held-out cell types; P-value threshold reflects machine precision; one-sided Wilcoxon rank-sum test; n = 211,000 cells). Boxplots: centre line, median; box limits, first and third quartiles; whiskers, 1.5× interquartile range. d, Summary of the top five enriched Gene Ontology biological process terms identified by gene set enrichment analysis of EBV DNAemia-associated genes. e, Same as d except excluding annotated HLA genes. f, Same as d but excluding all genes mapping to chromosome 6.
HLA-EBV peptide binding predictions
Although the HLA locus is pervasively associated with immune-mediated complex traits, these associations are challenging to resolve owing to allelic diversity, heterogeneity between human populations and lack of well-estimated (auto-) antigens that can mediate complex trait manifestation37. In our setting, the EBV proteome defines the set of candidate antigens variably presented by these alleles that would, in turn, variably yield EBV DNAemia. Hence, we reasoned that explicit modelling of HLA variation could refine our understanding of genetic variation underlying viral persistence.
To assess this, we first assembled four-digit HLA alleles across all donors in the UKB and AOU with NFE or EUR ancestry (Extended Data Fig. 6a and Methods). Using these per-donor genotypes and similar covariates to our GWAS, we performed a multivariate regression to assess whether each HLA allele was associated with variable rates of EBV DNAemia (Methods). We identified a total of 42 associated HLA alleles, including 18 class I and 24 class II alleles (nominal P < 0.05 in both cohorts; Extended Data Fig. 6b,c, Supplementary Table 6 and Methods). One of the strongest risk alleles for EBV DNAemia was HLA-A*03:01 (UKB, P = 0.0060; AOU, P = 9.63 × 10−25), previously linked with increased risk of multiple sclerosis[44](https://www.nature.com/articles/s41586-025-10020-2#ref-CR44 “Fogdell-Hahn, A., Ligers, A., Grønning, M., Hillert, J. & Olerup, O. Multiple sclerosis: a modifying influence of HLA class I genes in an HLA class II associated autoimmune disease: MHC class I associations in multiple sclerosis. Tissue Antigens 55,