Main
In 1752, King Frederik V of Denmark, known for his “generous attitude […] towards natural science and applied art” commissioned the Flora Danica project, initiating an ‘Opus Incomparibile’ that took 122 years and produced one of the world’s most unique works in natural history6. Over 3,000 botanic engravings and 54 booklets were completed of flowers and plants, with which, “according to the unanimous contention of all connoisseurs”, “the whole world can eventually reap all the fruits that follow the extension of a science which, with regard to the benefit of mankind, is one of the most useful and without which medicine and economics…
Main
In 1752, King Frederik V of Denmark, known for his “generous attitude […] towards natural science and applied art” commissioned the Flora Danica project, initiating an ‘Opus Incomparibile’ that took 122 years and produced one of the world’s most unique works in natural history6. Over 3,000 botanic engravings and 54 booklets were completed of flowers and plants, with which, “according to the unanimous contention of all connoisseurs”, “the whole world can eventually reap all the fruits that follow the extension of a science which, with regard to the benefit of mankind, is one of the most useful and without which medicine and economics would lack important advantages”6. In 2019, we initiated the Microflora Danica (MFD) project with the aim of cataloguing the microbiome of Denmark, in the hope that the microflora of Denmark can be similarly studied, and their riches contribute to the extension of science.
The MFD dataset
The MFD dataset comprises 10,683 samples, chosen to capture the diversity and geographical coverage of Danish microorganisms, associated with Illumina shotgun metagenomic DNA sequencing (average 4.5 Gb per sample, total 48.2 Tb) (Fig. 1a). Moreover, the dataset incorporates 14.9 million bacterial (median 4,528 bp) and 13.4 million eukaryotic rRNA operon sequences (median, 4,035 bp), as well as 6.4 million nearly full-length bacterial 16S rRNA gene sequences (median 1,355 bp). These data originate from a subset of samples (450 and 412, respectively) reflecting sample diversity while maintaining geographical coverage of the wider dataset (Fig. 1a). The samples are associated with GPS coordinates (Fig. 1b) and a highly curated five-level ontology (MFDO) (that is, habitat classification system) (Fig. 1c) that can be linked to other ontologies (EMPO3,7, Natura 2000 (ref. 8) and EUNIS9). The habitat ontology comprises sample type, area type and up to three levels of increasingly specific habitat description: MFDO1, MFDO2 and MFDO3 (MFD ontology levels 1, 2 and 3) (Fig. 1a,c). The area type ‘natural’ describes habitats not directly managed or located in urban areas. The Danish landscape is made up mainly of agriculture (63.0%), buildings and infrastructure (13.9%), forest (13.3%) and natural areas (9.2%), as well as streams and lakes (2.8%)10. The MFDO1 habitat ontology level represents 28 different categories (Fig. 1c) and reflects the primary land uses of fields (that is, croplands; 3,003 samples), grassland formations (1,393 samples), forests (1,328 samples) and greenspaces (that is, urban parks; 711 samples). The breadth of sampling is exemplified by our coverage of 27% of the 986 registered lakes in Denmark. Combined, the datasets, ontology, associated metadata and spatial resolution provide an extraordinary resource to investigate research questions related to diversity and function in microbial ecology.
Fig. 1: MFD sampling campaign and ontology.
a, The mean ± s.d. metagenome and rRNA amplicon data sequencing depths. The unit of measurement for depth is reads, except for metagenomes, for which the depth is reported as bp. M, million. b, The MFD samples cover the land of Denmark and its surrounding waters. The map depicts the locations of the samples used for metagenomics, and the colours represent the three different sample types. The top right cutouts show the island of Bornholm, which is east of Copenhagen and south of Sweden. The base map was retrieved from the Eurostat countries portal EuroGeographics for the administrative boundaries, © EuroGeographics 2025. c, Sample counts in the first three levels of the habitat ontology. The MFD habitat ontology accounts for a variable number of samples per category/branch. The Sankey diagram reports the first three levels of the ontology, and the thickness of the branches is proportional to the number of samples in each category. Only classes with n > 20 samples and non-empty MFDO1 classification are reported. Each habitat category is followed by the number of samples for that category in parentheses. The Sankey plot, including all five levels of the ontology, is provided in high resolution at Zenodo (https://doi.org/10.5281/zenodo.17162544).
Establishing Denmark’s microbiome
To facilitate sequence diversity analysis of bacteria, we used nearly full-length 16S rRNA genes extracted from the rRNA operon data, sequenced on the PacBio platform, and nearly full-length 16S rRNA gene amplicon data generated using unique molecular identifiers (UMIs). The UMI approach relies on the use of molecular nucleotide template tagging to achieve high-accuracy single-molecule consensus calling on the Oxford Nanopore sequencing platform. The addition of UMIs to both ends of the template enables the bioinformatic identification and removal of chimeras formed during PCR11. We investigated the bacterial sequence diversity and novelty using the combination of these data (Methods and Extended Data Fig. 1). The combined nearly full-length 16S (V1–V8) rRNA gene dataset included 458 habitat-representative samples and 21.3 million sequences, with 605,861 amplicon sequence variants (ASVs) representing 141,252 bacterial species (98.7% operational taxonomic units (OTUs))12 (Fig. 2a). Comparison of the species-level OTUs (clustered at 98.7% identity) against the SILVA v.138.1 database revealed that 82.5% were from new species (<98.7% identity) (Fig. 2a). However, the discovery rate of novelty quickly decreased at the higher taxonomic levels, with only 1.9% of the OTUs belonging to new families (<86.5% identity) (Fig. 2a). This suggests that while 16S rRNA gene sequences from bacteria originating from temperate northern European habitats are well represented in public databases at the higher taxonomic levels, the species-level diversity remains predominantly uncaptured.
Fig. 2: Novelty, diversity and read classification based on nearly full-length 16S and 18S rRNA gene sequences and the MFG 16S reference database.
a, Sequence novelty of species-level clustered bacterial 16S rRNA gene OTUs (98.7%) against SILVA19 v.138.1 NR99 and eukaryotic 18S rRNA gene OTUs (99.0%) against PR2 (ref. 16) v.5.0.0. Taxonomic thresholds for bacteria were adapted from ref. 12, whereas those for eukaryotes were calculated using a similar approach based on sequences from the PR2 v.5.0.0 database (Supplementary Note 1). Where indicated by an asterisk, thresholds were proposed on the basis of the sequence identity between species-level classified 18S rRNA gene sequences in the PR2 database and their closest relatives within and across ranks; meaningful thresholds above the family level could not be determined. b, Species-level rarefaction curves of UMI-based bacterial 16S rRNA and eukaryotic 18S rRNA gene OTUs from terrestrial samples. Insets: MFDO1 habitat-specific rarefaction curves for habitats represented by at least nine samples. c,d, Database evaluation based on 16S rRNA gene reads extracted from selected MFD metagenomes (c) and V4 OTUs clustered at 99% identity from the GPC23 dataset (d). Classification of metagenomic reads or OTUs was done using the SINTAX63 classifier. The following databases were used in addition to the MFG database created here: GreenGenes2_2022_10 taxonomy backbone22, GTDB_ssu_all_r220 (ref. 34 and SILVA_138.1_SSURef_NR99 (ref. 19). All databases were clustered at 98.7% sequence identity to enable direct comparison.
We used the nearly full-length 16S rRNA UMI dataset to estimate the Danish terrestrial bacterial richness (species count). The dataset encompasses 5.8 million 16S rRNA gene reads and 101,423 species (98.7% OTUs)12 across 309 habitat-representative samples (Fig. 2b). Rarefaction analysis showed underlying variation in the detection of species among MFDO1-level habitats, but approached saturation across the combined dataset, indicating that most species were captured by the sequencing effort (Fig. 2b). To support this, we calculated the habitat and pan-habitat community coverage to estimate how well our dataset captures the total terrestrial diversity of bacterial species13. We found that the community coverage at the MFDO1-level habitat ranged from 0.46 to 0.90, showing a strong correlation with sampling effort r7 = 0.95 (t = 7.96, 95% confidence interval (CI): 0.77–0.99, P = 9.4 × 10−5), but that the overall terrestrial community coverage amounted to 0.98, again indicating near complete species detection. Hill diversity estimates place the lower bound of the bacterial species count (Hill richness) in terrestrial MFD at a minimum of 114,400 species (95% CI = 113,897–114,902), with 43,447 common (intermediate to high frequency, Hill–Shannon14) and 22,036 dominant (most frequent, Hill–Simpson14) species based on their observation frequency in the dataset13,15. The community coverage estimates and rarefaction analysis indicate that the nearly full-length 16S rRNA gene dataset captures the collective Danish bacterial species pool across the investigated habitats and sets a conservative minimum estimate of the total environmental bacterial richness of terrestrial Denmark to be 114,400 species.
To investigate the diversity of eukaryotes, we used the eukaryotic rRNA operon sequences. These sequences exhibited a strong phylogenetic signal as they include both the ITS1 and ITS2 regions. However, the absence of a comprehensive rRNA operon reference database prompted us to focus our analysis on extracted nearly full-length (V4–V9) 18S rRNA genes that can be directly compared to the PR2 database16. The 13.4 million eukaryotic nearly full-length 18S rRNA gene sequences resolved into 28,575 ASVs representing 12,447 species (99% OTUs; Supplementary Note 1). Mapping of the species-representative sequences against PR2 revealed that most species (77%) are novel (Fig. 2a). Furthermore, 32% of the sequences had less than 93% similarity to a sequence in PR2, indicating high novelty at approximately the family level (Supplementary Note 1). Eukaryotic diversity varied between habitats but, based on Hill diversity estimates, the eukaryotic species count (Hill richness) is estimated to be a minimum of 19,295 species (Supplementary Note 2 and Extended Data Fig. 2). These findings show that vast microeukaryotic diversity remains undocumented.
MFG 16S rRNA gene database
Confident taxonomic assignment of 16S rRNA gene sequences relies on representative databases with clear taxonomic frameworks that include uncultured taxa17,18. As current universal reference databases lack the specificity we required, we used our extensive nearly full-length 16S rRNA gene dataset to create a comprehensive reference database for taxonomic classification of 16S rRNA gene reads extracted from our metagenomes. To enhance classification accuracy, we supplemented our sequences with high-quality sequences from SILVA v.138.1 SSURef NR99 (ref. 19), EMP500 (ref. 3), AGP70 (ref. 11), MiDAS20 and ref. 21 (Methods). This resulted in a total of 30.2 million sequences, which were processed using Autotax18 to create the Microflora Global (MFG) 16S rRNA gene reference database. The 1,034,840 unique ASVs were clustered at 98.7% nucleotide identity, representing 342,673 bacterial or archaeal species-level OTUs with a complete seven-level taxonomic string.
To evaluate the MFG 16S reference database, we first compared classification of metagenome-derived 16S rRNA gene fragments from a subset (n = 2,348; Methods) of our samples using both the MFG 16S reference database and publicly available databases clustered at the species level (98.7% identity) (Fig. 2c). We classified 46.1% (4.79 million out of 10.40 million) of all extracted 16S rRNA gene reads to genus level using the MFG 16S reference database, compared with 32.2% (3.35 million out of 10.40 million) classified by the second-best-performing database GreenGenes2 (ref. 22) (Fig. 2c). We next evaluated our database’s ability to classify data beyond Denmark’s temperate Northern Hemisphere habitats, using the Global Prokaryotic Census (GPC) V4 OTU dataset23 (Fig. 2d). The MFG 16S reference database was able to classify 47.7% (1.05 million out of 2.20 million) of the GPC OTUs at the genus level, compared with 32.7% (0.72 million out of 2.20 million) classified by GreenGenes2 (ref. 22). The combined results confirm that the MFG 16S reference database greatly improves classification not only for our samples, but for microbial profiling in general.
Diversity for habitat management
The level of microbial diversity in a habitat is often characterized by the alpha diversity, the richness in a single sample or average sample of a habitat and by the gamma diversity, the total observed richness of all samples within a habitat24. In contrast to the aboveground macro biodiversity, disturbed (that is, managed or directly affected by human activities) soils have been shown to have higher richness than undisturbed natural areas, both at continental and global scales24,25. Our detailed habitat ontology and the number of samples in each habitat type enabled us to re-evaluate these observations using both the metagenome-derived 16S rRNA gene fragments and the nearly full-length 16S UMI rRNA gene dataset taxonomically classified against the MFG 16S reference database.
To ensure that our data enabled valid comparisons between samples, we investigated biases introduced from sample treatment and location. Most of the agricultural samples were treated differently compared with the other soil samples (Methods), but this treatment had no observable effect on alpha and beta diversity and amounted to only around 2% of the community variation (Supplementary Note 3). We accounted for spatial bias resulting from more densely sampled locations by estimating the spatial autocorrelation using distance–decay analysis on the metagenome-derived 16S rRNA gene fragments (Extended Data Fig. 3). On the basis of the results, we identified representative samples of the MFDO1 habitats within the 10 km reference grid of Denmark (2,348 samples; Methods and Extended Data Fig. 1). Hierarchical clustering based on the average between-habitat Bray–Curtis dissimilarity of these samples (beta diversity) largely captured the expected relationships based on similar aboveground characteristics (for example, grass cover, monocultures, exposure) between the habitats. These relationships are exemplified by the clustering of fields, greenspaces and grassland formations (Fig. 3a).
Fig. 3: Microbial diversity of the Danish terrestrial habitats.
a, Diversity overview of the selected habitats. Each facet addresses a different measure of diversity. The nine MFDO1 habitats are represented in the rows of the multifacet plot. The dendrogram shows the between-group (branches) and within-group (nodes) Hellinger-transformed Bray–Curtis (BC) dissimilarity using the genus-level-classified 16S rRNA gene fragments from the spatially thinned dataset. Bootstrap values were calculated using 100 iterations. The heat map shows the relative abundances of the 20 most abundant phyla. The box plots of alpha diversity and bar charts of gamma diversity are based on the UMI 16S rRNA gene data. The number of biologically independent samples used for the diversity measurements per habitat are indicated. The hinges of the box plots correspond to 25th, 50th and 75th percentiles of the distributions, and the whiskers extend to 1.5× the distance between the 25th and 75th percentile. All individual samples are shown as points (with jitter for visualization). Gamma diversity (Hill–Shannon diversity) reflects a single value per habitat (that is, bar) based on rarefaction and extrapolation of n samples, and the error bars report the associated 95% CIs. b, Ordination of the metagenome (MG) dataset. PCoA of the 9,643 metagenome samples and coloured according to MFDO1 habitat description together with the results from the ANOSIM and PERMANOVA; P values were derived from 999 permutations in both cases. The visualization depicts the first two components. The contour plot was added to show the density of points. c, Subpanels of the individual 18 selected MFDO1 habitats coloured and presented with the results of the contrasts analysis. d, The sample distribution in the ordination space for MFDO1 ‘soil, natural, bogs, mires and fens’, coloured by classifications at the MFDO2 ontology level.
We calculated the alpha diversity from the nearly full-length UMI 16S rRNA gene data. In contrast to previous studies at the continental-Europe25 and global24 scale, which found the highest alpha diversity in samples from disturbed habitats, we found that the median bacterial diversity was highest in bogs, mires and fens (1,705 species) and lowest in temperate heath and scrub (1,274), with the diversity of disturbed habitats ranging in between (Fig. 3a and Supplementary Note 4). We found no significant difference in alpha diversity between fields, forests or grassland formations, contradicting the previous results from continental Europe while agreeing with global findings (Fig. 3a, Extended Data Table 1 and Supplementary Note 4). Additional large studies on other continents will be vital to resolving the effect of human disturbance on alpha diversity.
In contrast to the alpha diversity results, gamma diversity revealed key differences between disturbed and natural habitats (Supplementary Note 4). Fields had the lowest gamma diversity (12,797 common species), and along with greenspaces (20,336), was considerably less diverse than the more natural environment grassland formations (26,721). This trend was mirrored by sediments, with urban sediments (21,609) encompassing lower gamma diversity than natural sediments (27,126). Human disturbance reduces ecological breadth by creating more uniform environmental conditions, leading to lower gamma diversity26. This was supported by our comparison between urban and natural environments, in which greater environmental heterogeneity is encompassed by the natural habitats, reflecting greater habitat breadth and, consequently, higher gamma diversity. Overall, these data suggest that there is a gamma diversity gradient impacted by the level of perturbation, from highly disturbed fields to moderately disturbed greenspaces and relatively undisturbed grassland formations. These findings support an apparent homogenization (that is, low gamma diversity considering the high alpha diversity) of species in disturbed habitats—a pattern that was recently identified in other studies27. Habitat species homogenization was also supported by the Bray–Curtis analysis, which revealed low within-habitat dissimilarity of the prokaryotic communities (Fig. 3a).
Low gamma diversity in fields was most pronounced among bacterial communities (Fig. 3a), but also visible in the eukaryotic data (Supplementary Note 2), and reflected the aboveground macro biodiversity. Notably, temperate heath and scrub had similarly low gamma diversity to fields, but also low alpha diversity. However, this habitat is selective, defined by dry, infertile and acidic conditions (EUNIS habitat classification9), in contrast to the irrigated, nutrient- and pH-adjusted agricultural land.
These results show that the same bacterial species are found in the disturbed habitats, and that disturbed habitats are under selective pressures comparable to natural habitats with defined abiotic constraints. This highlights the need to incorporate gamma diversity when assessing microbial diversity. Including this broader perspective is particularly important when monitoring the impacts of land use and climate change, where community homogenization could lead to reduced ecosystem resilience and have implications for ecosystem functions28.
Modelling for habitat classification
After revealing the importance of gamma diversity for biodiversity assessments, we investigated how the microbial community could be used to classify habitats and its potential for tracking future habitat changes. Exploratory principal coordinates analysis (PCoA) performed on the eukaryotic 18S rRNA gene dataset revealed some separation between MFDO1 habitat categories (n = 363, analysis of similarities (ANOSIM), R = 0.46, P = 0.001, permutational analysis of variance (PERMANOVA), R2 = 0.07, P = 0.001; Supplementary Note 2). However, for the prokaryotic community, PCoA revealed good separation between MFDO1 habitats based on the metagenome-derived 16S rRNA gene fragment microbial community composition (n = 9,643, ANOSIM, R = 0.69, P = 0.001, PERMANOVA, R2 = 0.27, P = 0.001) (Fig. 3b,c). An exception was MFDO1 ‘bogs, mires and fens’, which showed large dispersion in ordination space. At the MFDO2 level, this habitat consists of both calcareous fens and sphagnum acid bogs, which have large differences in pH that impact microbial communities29 (Fig. 3d).
To determine the potential for microbial community DNA to be used in habitat classification, we investigated whether the 16S rRNA gene fragments could predict the habitat ontology (Fig. 4 and Supplementary Note 5). We evaluated habitat classifications using the precision recall area under the curve (PR-AUC; Fig. 4). Some habitats were difficult to model, for example, they had a lower PR-AUC (Fig. 4), such as the various types of fields, where the level of shared taxa was large. Conversely, other habitats—such as saltwater and wastewater, with higher PR-AUC—are associated with more specialized microbiomes. In general, low model scores reflected habitats in which samples would be misclassified to a few other selected habitats, for example, samples from grassland formations, greenspaces and fields were often misclassified as each other (Supplementary Note 5).
Fig. 4: Random-forest classification of habitat ontology levels using prokaryotic data.
The genus-level models were used to compile the per-class PR-AUC of every node of the ontology. The metric spans from 0 to 1, where 0 and 1 mean that none and all, respectively, of the samples of the given class were classified correctly. The mean results over iterations (n = 25 independent iterations) are reported in the tree labels and coloured accordingly, with brighter nodes carrying higher values. Moreover, the top 20 genera, according to variable importance (box plot at the bottom, computed using the MFDO3 models), are reported with their median relative abundance for each of the terminal nodes of the ontology. The three hinges of the box plots correspond to the 25th, 50th and 75th percentiles of the distributions, and the whiskers extend to a maximum of 1.5× the distance between the 25th and 75th percentile hinges. All of the individual samples are shown as points (with jitter to improve visualization). The sum of the variable importance across all variables was scaled to 100 for each model. The ranking of the variables indicates which genera have a greater discriminant power in the models. Notably, the models were reliable in classifying samples from agricultural soils (PR-AUC =0.95) but not at classifying individual crop types.
Considering which prokaryotic genera were the most important in discriminating among habitats (that is, highest variable importance) (Fig. 4), the strongest signal was provided by Paenibacillus, whose species have been found to be associated with crops, promoting plant growth and protection from pathogens, as well as fixation of nitrogen30. Paenibacillus was distributed across soils and sediments with higher counts in field habitats, perhaps functioning as a predictor for sample type and land use. Our findings support low-resolution discrete habitat classification (that is, MFDO1) using microorganisms, but not higher-resolution classifications (that is, MFDO2). This agrees with previous studies proposing the redefinition of habitats using continuous gradients31. We believe microbiome data could provide a scalable solution to future classification efforts, enabling gradients to be compared to measure or monitor changes related to climate, sustainable farming choices or restoration progress. Identifying the core microorganisms belonging to specific habitats, or habitat gradients, may help to simplify the use of microbiome data.
Core genera across Danish habitats
Core microorganisms are abundant and widespread within habitats, potentially reflecting populations with habitat-specific adaptations, functions and ecological importance32. We identified abundant core community genera in the habitats across all five habitat ontology levels (genera with more than 50% habitat-specific prevalence, as well as at least 0.1% relative abundance; Supplementary Data 1, Extended Data Fig. 4 and Supplementary Note 6).
Habitat-specific core genera were more numerous in habitats with strong selective environmental gradients (for example, halotolerance), or constrained habitats, such as biogas systems (Supplementary Data 2 and Extended Data Fig. 4). Conversely, we observed fewer habitat-specific core species if no habitat-specific selective pressure was present. For example, the median of core genera unique to the soil MFDO1 habitats was two, showing that many of the genera were shared among two or more MFDO1 habitats (such as fields and greenspaces). Combined with the observed model misclassification of ecologically similar environments, these findings suggest that despite the vast dispersal capabilities of microorganisms, the prokaryotic community follows a continuous gradient of change and is thus more influenced by specific environmental factors as opposed to geographical location, in accordance with the Baas Becking hypothesis: everything is everywhere, but the environment selects33.
The alpha, beta and gamma diversity patterns, and high model score for fields among the terrestrial environments (Fig. 4), showed that land disturbance and management lead to similar microbial communities (Fig. 3a). Land-management practices, such as nutrient amendment and soil structure degradation, probably drive environmental filtering of the prokaryotic communities27. The disturbed soil habitats (fields, roadside and greenspaces) and the natural soil habitats (bogs, mires and fens; coastal; dunes; forests; grassland formations; rocky habitats and caves; sclerophyllous scrub; and temperate heath and scrub), encompassed 107 and 98 core genera (that is, a core genus in at least one of the habitats under the disturbed or natural categories), respectively (Supplementary Note 6). Comparing the natural and disturbed habitats revealed differences in core genera associated with nitrogen cycling (for example, Nitrospira, and genera within the Nitrososphaeraceae and Nitrosomonadaceae; Extended Data Fig. 4 and Supplementary Note 6), leading us to investigate this functional group more closely.
To provide genome-level resolution, recover potential functional group members and improve the representativeness of public genome databases, such as the Genome Taxonomy Database34 (GTDB), we performed de novo assembly of the 10,683 metagenomes (Supplementary Note 7). We recovered 19,253 bacterial and archaeal metagenome assembled genomes (MAGs) of at least medium quality (Methods and Extended Data Fig. 5). These MAGs represented 5,518 species (95% average nucleotide identity clustering) with broad phylogenetic coverage of which 4,604 were novel compared with GTDB34 R220 (Supplementary Note 7). This MFD genome database provides the foundation for functional analysis linked to species identity and habitat distribution and enabled us to examine key participants in the biogeochemical nitrogen cycle, the nitrifiers, across Denmark.
Distribution of Danish nitrifiers
Our investigations into microbial diversity indicated that bacteria and archaea involved in the nitrogen cycle were abundant, and form part of the core community differences in disturbed versus natural habitats (Supplementary Notes 6 and 8). This microbiome fingerprint reflects that Denmark is one of the most intensively cultivated countries in the world (63% of the land10), with much of its land impacted by management regimes involving fertilization with reactive nitrogen35. As Denmark has a large livestock sector, manure is a major nitrogen source, alongside synthetic fertilizers. Conversion of nitrogen fertilizers by nitrifying microorganisms leads to fertilizer loss, groundwater nitrate contamination, eutrophication of aquatic water bodies, and production of the potent ozone-depleting and greenhouse gas nitrous oxide35,36. Consequently, nitrification inhibition with synthetic or biological inhibitors is gaining importance to limit nitrate leaching, nitrous oxide emissions, and to increase nitrogen-use efficiency36. The use of two commercial nitrification inhibitors has risen fivefold in the past five years, now covering around 3% (78,129 ha in 2025) of Danish agricultural land37. Notably, the different groups of nitrifiers, comprising ammonia-oxidizing bacteria (AOB) and archaea (AOA), complete ammonia-oxidizing bacteria (CMX) and nitrite-oxidizing bacteria (NOB), vary in their sensitivities to nitrification inhibitors and in their nitrous oxide production rates36,38. To build knowledge needed to move towards sustainable agriculture, we performed an in-depth analysis of nitrifiers in the MFD datasets. On the basis of an analysis of functional genes (GraftM39), single-copy marker genes (SingleM40) and genome-level quantification (sylph41), we describe the diversity and distribution of Danish nitrifiers and identify new uncharacterized AOAs and NOBs.
Initially, curated gene-based search models of the nitrification marker genes amoA (encoding a subunit of the ammonia monooxygenase of AOB, AOA and CMX) and nxrA (encoding the active-site subunit of nitrite oxidoreductase of NOB and CMX) were created, accompanied by detailed classification of protein phylogeny from the translated genes, to separate nitrifier sequences from homologous sequences in other microorganisms, such as PmoA (particulate methane monooxygenase) and NarG (nitrate reductase)42. Furthermore, we included translated amoA and nxrA sequences from the recovered MFD MAGs in the search mod