Introduction
Antimicrobial resistance (AMR) poses a growing threat to global health1. According to estimates from the Institute for Health Metrics and Evaluation, bacterial AMR was directly responsible for over 1.27 million human deaths2. By 2050, AMR will be responsible for 10 million deaths annually and result in a loss of $100 trillion from global gross domestic product[3](https://www.nature.com/art…
Introduction
Antimicrobial resistance (AMR) poses a growing threat to global health1. According to estimates from the Institute for Health Metrics and Evaluation, bacterial AMR was directly responsible for over 1.27 million human deaths2. By 2050, AMR will be responsible for 10 million deaths annually and result in a loss of $100 trillion from global gross domestic product3. Consequently, the AMR pandemic has been, and will continue to be, a critical challenge that beyond its effects on human health4. There is growing evidence that the origins of AMR may extend beyond human habitats5,6. Some antibiotic resistance genes (ARGs) originated from environmental bacteria7 and were transferred to humans via direct or indirect contact, such as through food chains8.
Soil is one of the main sources of microbes in terrestrial ecosystems9 and acts as the cornerstone of One Health framework10. Consequently, increasing attention is now being paid to the biogeography of the soil antibiotic resistome. The global distribution of the abundance of ARGs (via high-throughput quantitative PCR)11, the relative abundance of ARGs (via metagenomic analysis)12, and the diversity of soil ARG-carrying pathogens (via metagenomic analysis)13 have been mapped. Meanwhile, environmental factors14,15 and climatic seasonality11 have been regarded as the main drivers shaping soil antibiotic resistome. While the distribution and drivers of ARGs in soil seem clear, the sources of ARGs in soil and their relationship to clinical antibiotic resistance within the One Health framework remain unclear.
Indeed, it is not surprising that ARGs are present in natural soils as they are ubiquitous16 and can even be found in the Mariana Trench17, undisturbed Alaskan soil18,19, and 30,000-year-old permafrost sediments20. Multidrug efflux pumps are one of the main types of ARGs in soil14. However, it must be recognized that high abundance may not mean high risk and that only human-associated ARGs pose a serious health risk21. The One Health framework for antibiotic resistance focuses on ARGs that can cross ecological boundaries8. Therefore, identifying these “high-risk” ARGs from the presumptive ARGs is crucial for addressing this global issue with cost-effective strategies22. Zhang et al. developed an “omics-based” framework to evaluate ARG risk and identified the list of Rank I ARGs23. These Rank I ARGs were proposed based on host pathogenicity, gene mobility and human-associated enrichment, which are the focus of One Health8. Although Rank I ARGs are present in various habitats24,25, their global soil distribution, attribution, and connectivity between soil and other habitats remain unclear. This knowledge gap limits not only identifying priority areas for combating the soil antibiotic resistome, but also understanding of the relevant health risks.
Herein, we constructed a dataset that includes 3816 metagenomic datasets (including 2391 soil samples and 1425 other habitat samples) from public databases and 149 in-house data to catalog the profile and attribution of the soil antibiotic resistome (i.e., 3965 samples in summary). Meanwhile, we collected 8388 E. coli genomes (the main ARGs carrying pathogen in soil) isolated from soil, livestock, and humans (main source), to reveal the potential exchange network of soil and other habitats Rank I ARGs at the species level. Finally, we constructed global human clinical antibiotic resistance datasets to examine the relationship of antibiotic resistance between the soil and humans. We aimed to address two questions: (i) What are the sources of Rank I ARGs, and how do they change over time? (ii) How does the soil antibiotic resistome relate to the human antibiotic resistome and to human clinical antibiotic resistance?
Results
The attribution and profile for global soil antibiotic resistome
To gain a comprehensive insight into the role of the soil antibiotic resistome within the One Health framework, we compiled a dataset of 3965 metagenomic samples (including 2540 soil samples and 1425 other habitat samples) (Figs. S1 and S2, and Supplementary Data 1–2). We analyzed the ARGs using ARGs-OAP (v 3.2.2) and used the relative abundance of Rank I ARGs to assess the risk of ARGs (genes listed in Supplementary Data 3 and Text S2)26. The profile of Rank I ARGs was obtained based on the previously defined list, characterized by reported host pathogenicity, gene mobility, and enrichment in human-associated environments26. In detail, after obtained the ARGs profile for 3965 metagenomic samples, we compared them to a previously established Rank I ARGs list to identify the presence of specific variants within these samples. The Rank I ARGs profiles represented the observation of these ARGs across the detected metagenomic samples in this study, while the classification of Rank I ARGs is based on the previous reports (Supplementary Data 3)26. The SARG3.0_S database, which excluded sequences associated with transcriptional regulators (including activators and repressors), point mutations, and others, was used for similarity search annotation26. In addition, considering the controversy surrounding multidrug efflux pumps, all genes related to these pumps were excluded to avoid possible mis-annotations of ARGs. The profiles of antibiotic resistome in different habitats and different land use types (including rarefaction curves and composition of ARGs at the type and subtype level) were shown in Figs. S3 and S6. Briefly, the richness (number of subtype) and relative abundance of total ARGs (1739 subtypes, 0.13 copies per cell) and Rank I ARGs (175 gene types, 1.5 copies per 1000 cells) in soil were similar to those in wastewater treatment plant (WWTP) effluent but lower than those in livestock and human feces (Fig. 1a, b). The similar trends could also be observed after the minimum sample size for each habitat is standardized (Fig. S7). Principal Co-ordinates Analysis (PCoA) clearly separated the soil antibiotic resistome from those in other habitats (Adonis analysis, p < 0.01, Fig. 1c, d, Text S3, and Supplementary Data 4 and 5).
Fig. 1: The profile and attribution of global soil antibiotic resistome in metagenomic analysis.
a The relative abundance of total ARGs in various habitats. In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, and dots indicate values of individual samples (Soil: 2540, WWTP_water: 303, Natural_sediment: 280, Marine_water: 188, Chicken_feces: 146, Human_feces: 130, Crop: 122, Sewage: 121, Natural_water: 70, Swine_feces: 31, Landfill: 22, Cattle_feces: 12). b The relative abundance of Rank I ARGs in various habitats. In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, and dots indicate values of individual samples (Soil: 2540, WWTP_water: 303, Natural_sediment: 280, Marine_water: 188, Chicken_feces: 146, Human_feces: 130, Crop: 122, Sewage: 121, Natural_water: 70, Swine_feces: 31, Landfill: 22, Cattle_feces: 12). The classification of Rank I ARGs was based on a comparison with the previously reported list (Supplementary Data 3)26. c The PCoA with Bray-Curtis dissimilarity and Adonis analysis of total ARGs. A version with multiple viewing angles is available at https://github.com/Yuxiang-Zhao/ARGs/blob/main/PCoA/3D_PCoA_Total.html. The significance between different habitats could be observed in Supplementary Data 4. d The PCoA with Bray-Curtis dissimilarity and Adonis analysis of Rank I ARGs. A version with multiple viewing angles is available at https://github.com/Yuxiang-Zhao/ARGs/blob/main/PCoA/3D_PCoA_RankI.html. The significance between different habitats could be observed in Supplementary Data 5. The classification of Rank I ARGs was based on a comparison with the previously reported list (Supplementary Data 3)26. e The relative abundance of total ARGs over time. Each node represents the mean relative abundance for that year. Gray shading denotes the 95% confidence intervals. Pearson correlation (liner regression) was conducted to compare the relationship between time and relative abundance of total ARGs. f The relative abundance of Rank I ARGs over time. Each node represents the mean relative abundance for that year. Gray shading denotes the 95% confidence intervals. Pearson correlation test (two-sided) was conducted to compare the relationship between time and relative abundance and occurrence frequency of Rank I ARGs (relative abundance: p < 1.8e-05, occurrence frequency: p < 0.00023). The classification of Rank I ARGs was based on a comparison with the previously reported list (Supplementary Data 3)26. g Fast expectation-maximization for microbial source tracking (FEAST) estimating the source contribution of various habitats to the soil. Sample sizes were not normalized across habitats due to the limitations of sample numbers in some habitats. All habitat pairs were analyzed one-to-one, therefore the results indicate the proportion of shared ARGs. All habitat pairs were counted 999 times. h The presence and relative abundance of high-risk ARG subtypes detected in soil and their association with time. The relative abundance of Rank I ARGs was normalized to a range of 0 to 1 in each period. * indicates p < 0.05, R2 is calculated from the linear model, and r is calculated from the Pearson correlation (liner regression). The correlation was statistically tested using both Pearson’s correlation test (two-sided) and a linear model (two-sided). TEM-156: p < 0.041, erm(B): p < 0.026, ANT(2”)-Ia: p < 0.043, AAC(3)-VIa: p < 0.033, lsa(E): p < 0.041, ANT(6)-Ia: p < 0.025, mph(B): p < 0.02, tet(M): p < 0.016, dfrB1: p < 0.034. Period A: 2008–2010, Period B: 2011 –2013, Period C: 2014–2016, Period D: 2017–2019, Period E: 2020–2021.
However, some significant differences were observed between total ARGs and Rank I ARGs in soil. The PCoA results for Rank I ARGs showed less intergroup dispersion compared to total ARGs, implying greater similarity in Rank I ARGs across soil samples (Fig. 1c, d). The trends of total ARGs and Rank I ARGs over time also showed significant differences (Fig. 1e, f). The relative abundance of total ARGs was time-independent (r = 0.08, p > 0.05, Fig. 1e). In contrast, both the relative abundance of Rank I ARGs (r = 0.89, p < 0.001) and their occurrence frequency (presence of Rank I ARGs samples / total number of samples for the corresponding year) (r = 0.83, p < 0.001) showed significant increases over time (Fig. 1f). To eliminate the effects of sample size and collection site across different years on the observed trends, we divided the data into five distinct periods (details in Methods). To further ensure the reliability of the trends, we performed data normalization on both year (Fig. S8) and period (Fig. S9a). The changes in the relative abundance of total ARGs and Rank I ARGs were assessed under consistent data volume (Fig. S9b), consistent continental origin combinations (Fig. S9c), and consistent land use type combinations (Fig. S9d). All these results were consistent with previous findings, indicating that the relative abundance of total ARGs was time-independent, but ARGs risk showed a significant increase over time. Furthermore, after representative ARGs (biomarker ARGs identified by LEfSe, Text S4, and Supplementary Data 6), we found no unique Rank I ARGs in soil. These results suggested that soil Rank I ARGs largely overlapped with ARGs in other habitats (i.e., human-related), and soil might be both the sink for Rank I ARGs.
To confirm this hypothesis, the fast expectation-maximization for microbial source tracking (FEAST) analysis was carried out to reveal the sharing of ARGs between soil and other habitats27 (Fig. 1g). On average, soil shared 60.1% of total ARGs and 50.9% of Rank I ARGs with other habitats. Specifically, human feces (75.4%), chicken feces (68.3%), WWTP effluent (59.1%), and swine feces (53.9%) were attributed the most to the Rank I ARGs in soil. Moreover, regarding other habitats, soil shared 77.6%–91.9% of ARGs with crop surfaces, natural water, natural sediments, and marine water. After differentiating between different Rank I ARGs subtypes, we observed a consistent increase in mph(A), APH(3’)-Ia, AAC(6’)-le-APH(2”)-la, ANT(6)Ia, aadA, APH(6)-Id, aadA10, mef(B), and APH(3”)-Ib, along with the first detection of NMD-19 in soil samples in 2021 (Fig. 1h). Thus, given the growing risk posed by Rank I ARGs in soil and their close relationship with those found in livestock and human feces, soil might be a key node for controlling the spread of ARGs under the One Health framework. Overall, more attention should be paid to Rank I ARGs in soil, as they may act as a sink/source for clinical ARGs in hotspot regions.
Profile of pathogens carrying ARGs in soils
We annotated ARG-carrying contigs (assembled from the 2540 soil samples) and their taxonomy to reveal soilborne pathogens carrying ARGs (pARGs) and Rank I ARGs (pRank I). Only the prokaryotic pathogens listed in two published reference pathogen lists, the PHI-base (www.phi-base.org/) and Catalog of Human Transmitted Pathogenic Microorganisms 2023, were focused (Supplementary Data 7). Before further analysis, ARGs carried by plasmids were excluded using geNomad28, and the ARGs carried by chromosomes were focused, because of the difficulty in identifying the host of plasmids. Firstly, we investigated the distribution of pARGs and pRank I. The results showed that pARGs were present in over 29.08% of the samples, and pRank I was detected in 5.57% of the samples (Fig. 2a). Among the collected samples, higher relative abundance of pARGs and pRank I were observed in the eastern coastal regions of North America and Asia. At the contig level, Rank I ARGs accounted for 0.64% of the total relative abundance of ARGs. A total of 3.55% of the ARGs were pARGs, of which 0.06% were pRank I (Fig. 2b). A total of 103 prokaryotic pathogens were detected, hosting 67 gene types of ARGs across 19 species. These species belonged to Proteobacteria (13 species), Firmicutes (3 species), and Actinobacteriota (3 species) (Fig. 2c). All of these pathogens carrying Rank I ARGs were capable of infecting humans, with catI, catB3, and eptA being the predominant pRank I types (Fig. 2d). Our results suggested that E. coli was the most abundant pathogen carrying ARGs on chromosomes, accounting for 46.0%, followed by a plant pathogen Pectobacterium carotovorum (27.2%) (Fig. 2e). Overall, pARGs, especially pRank I, had a high occurrence frequency in soil and E. coli was the predominant prokaryotic pathogen in soil.
Fig. 2: The profile of pathogens carrying ARGs in soil.
a Distribution of prokaryotic pathogens carrying ARGs (pARGs) and Rank I ARGs (pRank I) across the global. ARGs carried by plasmids were excluded using geNomad28, and only ARGs carried by chromosomes were included. Color represented the relative abundance of pARGs (gray: not detected, orange 0–10 copies per 1000 cells, red >10 copies per 1000 cells). The points for the islands are not marked on the map. The shape represents whether or not this pARGs and pRank I are detected (solid squares: the absence of pARGs and pRank, hollow squares: detection of pARGs only, hollow circle: detection of pRank I). The log transformation was performed on copies / 1000 cells. The outer circle of the pie chart represented the proportion of sample points for which pARGs were detected, and the inner circle represented the proportion of sample points for which pRank I were detected. b The proportion of pARGs and pRank I. The outer circle of the pie chart represented the proportion of Rank I ARGs to total ARGs at the contigs level. The inner circle of the pie chart represented the proportion of pARGs and pRank I to total ARGs at the contigs level. c Overall characteristics of prokaryotic pathogens in soil, including ARG carriers and non-carriers, taxonomy (phylum), level of risk, and pathogenicity. Only ARGs carried by chromosomes were included. d The profile of pARGs. Rank I ARGs were highlighted in red. e The profile of pathogens carrying ARGs on chromosomes.
The profile for E. coli isolation genomes
As E. coli was one of the most important pathogens carrying ARGs in soil, it was selected as a representative of soil prokaryotic pathogens. We collected the genomes of 9700 E. coli strains isolated from soil, human, chicken feces, cattle feces, and swine feces (E. coli from specific habitats were referred to the habitat source E. coli, such as soil source E. coli) from the NCBI database. After filtering, 1312 genomes were excluded due to a lack of isolation information or misclassification (Fig. S10). A total of 8388 genomes, isolated between 1977 and 2023, were used for further study (Figs. S10 and S11 and Supplementary Data 8). Meanwhile, given their importance in soil (Fig. 1), we focused on Rank I ARGs and mobile Rank I ARGs (MRank I ARGs), which have mobile genetic elements (MGEs) located within 5 kb upstream and downstream of the Rank I ARGs22. The same database (SARG3.0_S) used for metagenomic ARGs analysis was conducted, and genes related to multidrug efflux pumps were also excluded. Firstly, we constructed the rarefaction curves for E. coli isolate genomes, which showed that the rarefaction curves reached the plateau for the studied habitats and clearly reflected the diversity of ARGs in these habitats (Fig. S12).
We focused on the temporal trends in the mean copy number of Rank I and MRank I ARGs, the occurrence frequency of MRank I ARGs (number of Rank I / MRank I gene types), and their richness in soil source E. coli (Fig. 3a, b). The results showed a significant increase in all parameters over time (p < 0.001, r = 0.69–0.85), indicating a growing potential ARGs risk associated with soil source E. coli each year. Even after rarefying based on the same sample size, the consistent trend remained (Fig. S13). To better understand ARGs risks across different habitats, we compared Rank I ARGs E. coli from various sources (Fig. 3c, d). We further performed period-based subsampling, aligned with metagenomic groupings, to ensure that each habitat and period had equal data volumes, eliminating the effect of sample size. Results confirmed that the mean copy numbers and richness of Rank I and MRank I ARGs for per soil E. coli genome showed a highly significant increasing trend over time, similar to that of swine and human E. coli (Fig. 3c, d). E. coli from chicken, on the other hand, showed an increasing and then decreasing trend, while those from cattle exhibited no correlation with time. Among all habitats, E. coli genomes from soil showed the greatest variation in both mean copy numbers and richness when comparing post-2020 with pre-2007. Mean copy numbers increased by 2.6 to 6.9 times, while richness rose by 1.9 to 5.5 times. A total of 30 Rank I ARGs (belonging to 30 subtypes and 9 types) were observed in soil E. coli, conferring resistance to the antibiotic classes of aminoglycoside, beta-lactam, tetracycline, and macrolide-lincosamide-streptogramin (MLS) (Fig. 3e). Among all Rank I ARGs and MRank I ARGs, the highest relative abundance were observed for eptA and tet(B), which were detected in all periods. Overall, these results indicated that, in the typical pathogens, the risk of ARGs in soil has also increased year by year, consistent with the metagenomic analysis.
Fig. 3: Rank I ARGs copy number and richness in the genomes of 8388 E. coli isolates.
a Mean of E. coli Rank I ARGs (per genome), mobile Rank I ARGs (MRank I ARGs), and the occurrence frequency of MRank I ARGs (genomes existing MRank I ARGs/ total genomes) over time in soil (1404 genomes). Each point represented the mean of all samples for that year. Gray shading denotes the 95% confidence intervals. Pearson’s correlation (two-sided) with loess regression for data smoothing was used to test the statistical relationship between variables (mean copy number for Rank I ARGs: p < 0.00018, mean copy number for MRank I ARGs: p < 6.7e-07, occurrence frequency of MRank I ARGs: p < 1.9e-07). b Richness of E. coli Rank I ARGs (per genome) and MRank I ARGs over time in soil (1404 genomes). The richness was calculated in gene level. Each point represented the average of all samples for that year. Gray shading denotes the 95% confidence intervals. Pearson correlation (loess regression) was conducted to obtained the p value and r. Richness refers to the number of Rank I and MRank I ARGs gene type. Pearson’s correlation (two-sided) with loess regression for data smoothing was used to test the statistical relationship between variables (diversity for Rank I ARGs: p < 0.00015, diversity for MRank I ARGs: p < 1.3e-07). c Mean copy numbers of Rank I ARGs and MRank I ARGs in different habitats across various periods (using 8,388 genomes collected from various habitats). 999 rounds of random sampling based on the minimum habitat and period counts were conducted to eliminate the impact of sample size differences across different years on the results. 38 genomes per habitat per period were included in the analysis. Each point represented the average of all samples for that year. “Before 2007” represented genomes collected prior to 2007, and the remaining period groupings were consistent with that in metagenomic analysis (i.e., Period A: 2008–2010, Period B: 2011–2013, Period C: 2014–2016, Period D: 2017–2019, Period E: After 2020). In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, dots indicate the average value from each random sampling. Significant comparisons (two-sided t test) between different temperature and different groups were calculated and detailed p value and T stat were provided in the Supplementary Data 9. d Richness of Rank I ARGs and MRank I ARGs in different habitats across various periods (using 8388 genomes collected from various habitats). 999 rounds of random sampling were performed and 38 genomes per habitat per period were included. Significant comparisons (two-sided t test) between different temperature and different groups were calculated and detailed p value and T stat were provided in the Supplementary Data 10. In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, dots indicate the average value from each random sampling. Richness refers to the number of Rank I and MRank I ARGs gene type. e The Rank I ARGs copy number, MRank I ARGs copy number, and the occurrence frequency of MRank I ARGs (genomes existing MRank I ARGs/ total genomes) in soil collected from different period. The Rank I ARGs copy number, MRank I ARGs copy number, and the occurrence frequency of MRank I ARGs were normalized to a range of 0 to 1 in each period. R2 is calculated from the linear model, and r is calculated from the Pearson correlation (liner regression). * indicated p < 0.05, ** indicated p < 0.01. The correlation was statistically tested using both Pearson’s correlation test (two-sided) and a linear model (two-sided). Rank I ARGs: catII: p < 0.048, erm(42): p < 0.048, mef(B): p < 0.0074, AAC(3)-IId: p < 0.028, mph(A): p < 0.0086, cmlA5: p < 0.0062, linG: p < 0.0090, aadA5: p < 0.0065, AAC(6’)-Ib7: p < 0.0078, tet(M): p < 0.0074, sul3: p < 0.010, floR: p < 0.0078, aadA2: p < 0.018. MRank I ARGs: AAC(3)-IId: p < 0.013, mph(A): p < 0.0052, cmlA5: p < 0.0096, linG: p < 0.0087, aadA5: p < 0.0053, tet(M): p < 0.047, sul3: p < 0.015, AAC(3)-IIe: p < 0.034, floR: p < 0.0095, aadA2: p < 0.0049, tet(B): p < 0.021. occurrence frequency of MRank I ARGs: aadA8: p < 0.041, lnu(G): p < 0.041, AAC(3)-IId: p < 0.045, erm(B): p < 0.036, tet(M): p < 0.036.
The association of soil Rank I ARGs with other habitats
Metagenomic analysis suggested that soil may be a potential sink for Rank I ARGs in humans and livestock (Fig. 1). To validate this hypothesis, FEAST analyses of the Rank I ARG composition in E. coli isolate genomes were performed to assess the influence of human and livestock (potential sources) on soil (a potential sink) across different periods (Fig. 4a, b)27. Briefly, human source (26.6%) and chicken source (25.3%) contributed the most to soil source, followed by cattle source (23.9%) (Fig. 4a). Remarkably, the relative importance of different sources showed significant variation across different periods (Fig. 4b). The contributions of E. coli from human and chicken sources to soil showed a significant upward trend, while those from cattle and swine sources exhibited a decline. Human contributions surpassed those from chicken in Period B (2011–2013) and have remained the dominant source for soil E. coli through the most recent period (Period E, post-2020) (Fig. 4b).
Fig. 4: Attribution of soil Rank I ARGs and the connectivity of soil with other habitats.
a The attribution of soil Rank I ARGs. Total 8388 E. coli isolates were included in the FEAST analysis. 999 rounds of random sampling were carried out, and 38 genomes per habitat per period were included in each calculation. In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, dots indicate the average value from each random sampling. Significant comparisons (two-sided t test) between different temperature and different groups were calculated and detailed p value and T stat were provided in the Supplementary Data 11. b The impact of different habitats on the composition of soil Rank I ARGs. In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, and dots indicate the average value from each random sampling. Nine hundred and ninety-nine rounds of random sampling were carried out, and 38 genomes per habitat per period were included in each calculation. Significant comparisons (two-sided t test) between different temperature and different groups were calculated and detailed p value and T stat were provided in the Supplementary Data 11. The correlation was statistically tested using linear model (two-sided). R2 and slope is calculated from the linear model. Cattle feces: p < 0.017, human: p < 0.026, chicken feces: p < 0.034. c Workflow for calculating connectivity. d The connectivity of soil with other habitats over time. 999 rounds of random sampling were carried out, and 30 genomes per habitat per period were included in each calculation. The results used eptA (one gene type of Rank I ARGs) as an example. In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, dots indicate the average value from each random sampling. Significant comparisons (two-sided t test) between different temperature and different groups were calculated and detailed p value and T stat were provided in the Supplementary Data 12. The correlation was statistically tested using linear model (two-sided). R2 and slope is calculated from the linear model. Human: p < 0.0034, chicken feces: p < 0.0065, pig feces: p < 0.031.
To further explore the association between Rank I ARGs carried by E. coli from different sources and soil source E. coli, we introduced an evaluation standard to assess their connectivity. Briefly, this method calculated the degree of habitat alternation for identical Rank I ARGs (gene level) on a phylogenetic tree and normalizes it against theoretical extremes to determine the connectivity between other habitats and soil (Fig. 4c). Since eptA is the dominant Rank I ARG gene conferring resistance to polymyxin in E. coli, we used this gene as an example to calculate the connectivity of Rank I ARGs between different habitats and soil. Results showed that the highest level of connectivity could be observed between human and soil (connectivity = 0.28), followed by chicken (connectivity = 0.25), and swine (connectivity = 0.21). Consistent with the FEAST results, we also observed a significant increase in the connectivity between humans and soil over time, exhibiting an almost linear relationship (p < 0.001, R² = 0.91). These results underscored that soil was an important node for the dissemination of Rank I ARGs. Moreover, E. coli resistome for human sources and soil sources were highly associated, both in terms of potential relationship (calculated by FEAST) and connectivity.
Genome pairs with identical ARG sequences across different habitats
In light of the observed connectivity between soil and other habitats, we further examined the proportion of identical ARG sequences between E. coli genomes from soil and other habitats. We screened all 8388 genomes for ARGs genetic blocks (complete open reading frames, ORFs) of 100% identical sequences shared between any pairs of genomes from different habitats. To avoid inflating estimates of the number and frequency of sequence sharing events, we defined these events as the proportion of genome pairs that share at least one similar sequence, rather than using the absolute count of distinct BLAST sites. To ensure the reliability of the results, we randomly selected 100 genomes from each habitat for each calculation and repeated this process 999 times (Fig. 5a and Fig. S14). Only genome pairs that included soil sources E. coli were focused, and a total of 44,905,050 calculations were performed. Results showed that the proportion of sequence sharing events between soil and other habitats reached 8.7% for total ARGs and 0.8% for Rank I ARGs (Fig. 5b).
Fig. 5: Genome pairs of identical ARGs in soil and other habitats.
a Workflow for finding genome pairs with identical ARGs. b Proportion of genome pair with identical ARGs between soil-soil and soil-other habitats. Soil indicated that the genome pairs were both from the soil habitat. c SNPs between various genome pairs. The pie chart represented the proportion of SNPs with counts ≥1000 and <1000. All calculations (44,905,050 pairs) were included. d Percentage of VGT (SNP < 1000) and HGT (SNP ≥ 1000) in various habitats. e Phylogenetic tree of genomes isolated from soil and cattle with SNP ≥ 1000. f Phylogenetic tree of genomes isolated from soil and humans with SNP ≥ 1000. g Phylogenetic tree of genomes isolated from soil and chicken with SNP ≥ 1000. h Phylogenetic tree of genomes isolated from soil and swine with SNP ≥ 1000. i Proportion of genome pair with identical ARGs between soil-soil and soil-other habitats in different periods. Soil indicated the genome pairs were both from the soil habitat. In the boxplots of panels, hinges indicate the 25th, 50th, and 75th percentiles, whiskers indicate 1.5× interquartile ranges, and dots indicate the average value from each random sampling (n = 999 rounds of random sampling). Significant comparisons (two-sided t test) between different habitats and different periods were calculated, and detailed p value and T stat were provided in the Supplementary Data 13. j Changes in identical Rank I ARGs copy numbers between soil and human across different periods. The copy numbers of Rank I ARGs was normalized to a range of 0 to 1 in each period.
The similarity between sequences may result from both vertical gene transfer (VGT) and horizontal gene transfer (HGT). To distinguish them, we compared single-nucleotide polymorphisms (SNPs) in the vertically transmitted, slow-evolving genes of each genome pair29. The DNA sequences corresponding to the 120 ribosomal genes were extracted via GTDB-Tk, the maximum length was 175,318 bp and the average length was 161,396 bp (Fig. S15). A molecular clock of 1 SNP/genome/year and a genome size of 106 bp was assumed30,31, which is a strict standard for E. coli32. The single-copy protein could generate up to 8.3 SNPs over a 47-year period (1977–2023, the years of genome collection) across 175,318 bp. We set a threshold of 1000 SNPs to distinguish whether the sequence sharing events were caused by VGT along bacterial lineages (SNPs < 1000) or HGT (SNPs ≥ 1000). This threshold, significantly higher than the 8 SNPs expected from VGT, ensures the conservativeness of the results. It showed that more than 40.7% of genome pairs exhibited sequence similarity that could be considered indicative of HGT, while the remaining 59.3% of genome pairs were classified as VGT (Fig. 5c). Additionally, when comparing SNP thresholds of 10 and 100, we found that the 10 SNP threshold filtered only 2.9% of bacterial pairs, and the 100 SNP threshold filtered 3.4%, further supporting the stringency of the 1000 SNP threshold (Fig. S16). Indeed, nearly 95.0% of the Rank I ARGs involved in potential HGT events had MGEs located within 5 kb upstream and downstream (Fig. S17).
The proportion of sequence sharing events occurring in genome pairs with over 1000 SNPs was the highest between human-soil and chicken-soil, accounting for 53.3% and 51.2%, respectively (Fig. 5d). To visualize the phylogenetic distances between genomes from different habitats, four phylogenetic trees were constructed, including cattle-soil (Fig. 5e), human-soil (Fig. 5f), chicken-soil (Fig. 5g), and swine-soil (Fig. 5h) pairs. The results revealed distinct habitat-specific characteristics in the phylogenetic distribution of E. coli genomes from human and soil sources (Fig. 5f). Given the highest number of sequence sharing events between human and soil genome pairs, along with distinct habitat-specific features in the genomic phylogenies, we suggested the possibility of cross-habitat horizontal gene transfer (HGT) between humans and soil. We further analyzed the temporal distribution patterns of sequence similarity events between various habitats and soil (Fig. [5g](https:/