A chromosome-level genome assembly of the dried fruit mite <i>Carpoglyphus lactis</i>

Background & Summary

Astigmatic mites represent one of the most diverse and speciose lineages within the subclass Acari, encompassing 77 families and over 6,000 described species1. Traditionally, they are thought to have originated from Oribatid mites in soil ecosystems[2](https://www.nature.com/articles/s41597-025-06038-w#ref-CR2 “Klimov, P. B. et al. A transitional fossil mite (Astigmata: Levantoglyphidae fam. n.) from the early Cretaceous suggests gradual evolu…

Background & Summary

Astigmatic mites represent one of the most diverse and speciose lineages within the subclass Acari, encompassing 77 families and over 6,000 described species1. Traditionally, they are thought to have originated from Oribatid mites in soil ecosystems2. However, in contrast to their ancestral forms, astigmatid mites display exceptional ecological adaptability, allowing them to colonize a wide array of fragmented and transient habitats3,4. Furthermore, many species employ a dispersal strategy known as phoresy, wherein they attach to other organisms for transportation5,6. Their pronounced ecological versatility and evolutionary success make astigmatid mites an excellent model system for investigating adaptive evolution within the Acari.

The genus Carpoglyphus, belonging to the family Carpoglyphidae, includes six valid species, most of which are recognized as economically important pests associated with stored products7. Among them, C. lactis (Linnaeus, 1767), commonly referred to as the dried fruit mite, is a cosmopolitan species reported across a wide variety of food commodities, including dried fruits, beer, dairy products, jams, honey, and wine8,9. Under the warm and humid conditions prevalent during summer and autumn, this mite proliferates rapidly, leading to significant economic losses in the food processing and sugar industries. Furthermore, C. lactis can be transmitted to humans via multiple routes, triggering a spectrum of clinical symptoms, such as mite-induced dermatitis, allergic alveolitis, and food mite syndrome10. Despite its agricultural and medical relevance, a high-quality reference genome has been lacking, limiting insights into its evolutionary biology and impeding the development of effective management strategies.

In this study, we report a high-quality chromosome-level genome assembly for Carpoglyphus lactis. We employed a hybrid sequencing and assembly approach integrating PacBio long-read sequencing for de novo assembly, Illumina short-read data for error correction and polishing, and Hi-C chromatin interaction data to reconstruct chromosome-scale scaffolds. Gene annotation was performed using a comprehensive pipeline that combined ab initio prediction, homology-based evidence from known protein sequences, and transcriptomic data obtained from both Illumina RNA-seq and Nanopore-based full-length transcript sequencing. This chromosome-level reference genome represents a valuable resource for advancing studies on the evolutionary biology of astigmatid mites and for facilitating the development of targeted strategies for managing this economically significant pest.

Methods

Sample preparation

The C. lactis population used in this study was originally collected from dried fruits maintained in the insect rearing facility of the Institute of Entomology, Guizhou University (Guiyang, Guizhou, China). In the laboratory, mites were fed with yeast and reared under controlled conditions: a constant temperature of 25 ± 2 °C, relative humidity of 75 ± 5%, and kept in complete darkness. To minimize heterozygosity and establish a genetically stable strain, one female and one male were isolated from the original population to initiate a new colony. A five-generation inbreeding program was carried out by pairing one male and one female in each generation to propagate the next generation. From the resulting fifth-generation inbred line, adult females were selected as the source material for whole-genome sequencing analysis.

DNA sequencing

For PacBio long-read sequencing, genomic DNA was extracted from 200 adult females. A 20-kb SMRTbell library was constructed using the SMRTbell® Express Template Prep Kit 2.0 (Pacific Biosciences, Cat. #PN 101-853-100, Menlo Park, CA, USA) and sequenced on the PacBio Sequel IIe platform. After quality filtering, approximately 5.5 Gb of high-quality SMRT sequences were retained, providing ~98 × genome coverage. For Illumina short-read sequencing, genomic DNA was also extracted from 200 adult females. A standard whole-genome sequencing library with an insert size of ~350 bp (paired-end 150 bp) was prepared using the TruSeq DNA PCR-Free Library Preparation Kit and sequenced on the Illumina NovaSeq X Plus platform, generating 11.4 Gb of paired-end reads. To anchor contigs into chromosome-scale scaffolds, we employed high-throughput chromosome conformation capture (Hi-C) sequencing technology. Fresh tissue samples from 1,000 adult females were crosslinked with formaldehyde, followed by digestion with the restriction enzyme MboI and proximity ligation to capture three-dimensional chromatin interactions. The resulting libraries were sequenced on the Illumina NovaSeq X Plus platform, producing 8.3 Gb of clean paired-end reads (150 bp in length). All library preparation and sequencing operations were performed by Berry Genomics (Beijing, China). The total amount of raw sequencing data and the corresponding sequencing depth are summarized in Table 1.

RNA sequencing

For RNA short-read sequencing, total RNA was extracted from 1,000 adult females using TRIzol™ Reagent (Thermo Fisher Scientific). Poly(A)-enriched libraries were constructed according to the manufacturer’s instructions using the VAHTS mRNA-seq v2 Library Prep Kit (Vazyme, NR603). Sequencing was performed on the Illumina NovaSeq X Plus platform with a paired-end 150 bp strategy, yielding a total of 7.7 Gb of high-quality clean reads. For long-read RNA sequencing, poly(A)-enriched RNA was isolated using the NEBNext Poly(A) mRNA Magnetic Isolation Module. A Nanopore sequencing library was subsequently prepared using the SQK-PCS109 + SQKPBK004 kit. Library construction was carried out by BenaGen (Wuhan, China). The purified library was sequenced on an Oxford Nanopore PromethION platform, generating 9.7 Gb of long-read sequences.

Genome survey

First, we performed quality control on the raw Illumina data using BBMap v39.25[11](https://www.nature.com/articles/s41597-025-06038-w#ref-CR11 “Bushnell, B. BBtools. Available online: https://sourceforge.net/projects/bbmap/

(accessed on 5 May 2025) (2014).“). Duplicate reads were removed using the “clumpify.sh” script with default parameters. Subsequently, low-quality reads were filtered out using the “bbduk.sh” script, applying criteria such as base quality (>Q20), minimum read length (>15 bp), poly(A/G/C) trimming (>10 bp), and correction of overlapping paired-end reads.

The genome survey was conducted based on k-mer frequency distribution, with 21-mer frequencies assessed using Jellyfish global 112. Genome feature analysis was performed with GenomeScope v2.013, setting the maximum k-mer depth threshold to 10,000 and parameters specified as ‘-k 21 -p 2 -m 10,000’. The results indicated a predicted genome size of 48.52 Mb and a heterozygosity level of 1.25%. Detailed analysis results are presented in Fig. 1.

Fig. 1

GenomeScope genome size estimates for C. lactis.

Genome assembly

High-quality HiFi reads were initially assembled using Hifiasm v0.25.0-r72614 with the parameter “-l 3”. To exclude low-depth sequences potentially originating from contamination or sequencing errors, only contigs with a sequencing depth exceeding 9× (approximately 1/10 of the estimated genome coverage) were retained. Subsequently, the assembly was further polished using nextPolish v1.4.115, incorporating both second and third-generation genomic sequencing data for error correction. To remove redundant sequences from the polished assembly, Purge_dups v1.2.516 was applied. Minimap2 v2.2417 was used as the alignment tool to map HiFi reads to the genome (“-cx map-hifi”) and for self-alignment of the genome (“-x asm5 -DP”). Due to the high heterozygosity of the genome, Purge_dups was run with parameters “-2 -a 60”.

Chromosome-level scaffolding and contig assembly were performed based on Hi-C data using the Chromap and YAHS pipelines. First, Hi-C data were subjected to quality control using Chromap v0.3.0-r50918, including read alignment, removal of PCR duplicates, and extraction of Hi-C interaction signals. Subsequently, two rounds of scaffolding were carried out using YAHS v1.219 with default parameters. Following the initial round of scaffolding, the assembly was manually curated using Juicebox v1.11.0820, followed by a final round of scaffolding. The sequencing depth of the final genome assembly was assessed using SAMtools v1.16.121, with aligned BAM files generated by minimap2 based on HiFi reads (“-ax map-hifi”) or second-generation whole-genome sequencing data (“-ax sr”). As shown in the Hi-C scaffolding heatmap (Fig. 2), the quality of the chromosome-level scaffolding was excellent, resulting in a total of nine chromosome-level assemblies.

Fig. 2

Genome-wide chromosomal heatmap of C. lactis, the blue boxes.

Genome completeness was assessed using BUSCO v5.8.322 with the arachnida_odb12 reference database, which contains 1,123 single-copy orthologous genes. To evaluate both the utility of the raw sequencing data and the integrity of the genome assembly, short-read and long-read sequencing data were aligned to the assembled genome using Minimap2, and alignment rates were calculated with SAMtools. Potential contamination in the assembly was screened using MMseq. 2 v13723 with a sequence identity threshold of 0.8 (–min-seq-id 0.8) against the NCBI nt_pork (bacteria and archaea) database. Additionally, a BLASTN search (BLAST + v2.12.024) was performed against the UniVec database and the reference genome of Saccharomyces cerevisiae to specifically detect vector and feed-related contaminants, respectively. Sequences showing over 90% similarity to entries in these databases were considered contaminants and removed from the final assembly. Single-base quality scores (QV values) and k-mer spectra were evaluated using Merqury v1.425. The final chromosome-level genome assembly of C. lactis is 53.47 Mb, consisting of 12 scaffolds and 121 contigs, with scaffold N50 and contig N50 values of 5.92 Mb and 713.31 kb, respectively. Of these, 118 contigs (accounting for 99.86% of the total length) were anchored onto 9 chromosomes (Fig. 3). The remaining three scaffolds are all short, unplaced nuclear sequences that could not be anchored to a chromosome. Chromosome lengths ranged from 5.21 to 6.71 Mb, with an average GC content of 40.9% and a QV value of 49.4 (Table 2).

Fig. 3

Gene features in the C. lactis genome. The tracks, from innermost to outermost, represent the nine chromosomes (Chr1–Chr9), GC content, GC skew, and gene density.

Genome annotation

The RepeatModeler v2.0.426 software was used with the additional LTR search option enabled (-LTRStruct) to construct a species-specific repeat database based on the structural characteristics of repetitive sequences and de novo prediction principles. This database was subsequently combined with the Dfam 3.727 and RepBase-2018102628 databases to generate a comprehensive reference database for repeat sequence identification and alignment. Repetitive sequence prediction was carried out using RepeatMasker v4.1.5[29](https://www.nature.com/articles/s41597-025-06038-w#ref-CR29 “Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org

(2013/2015).“) with the final constructed repeat database. The results showed that repetitive sequences totaled 5,167,192 bp, accounting for 9.66% of the genome. In the C. lactis genome, the top five categories of repetitive elements were: Unknown (4.46%), LTRs (1.82%), DNA transposons (1.35%), LINEs (0.49%), and SINEs (0.12%). The detailed statistical results are presented in Table 3.

Two complementary strategies were employed for the annotation of non-coding RNAs (ncRNAs): (1) Sequences were aligned to the Rfam database, a collection of known non-coding RNA families, using Infernal v1.1.530 for the identification and annotation of rRNA, snRNA, and miRNA sequences. (2) tRNA genes were predicted using tRNAscan-SE v2.0.131, which is specifically designed for accurate detection of tRNA sequences in genomic data. The annotation results revealed a total of 214 non-coding RNA genes, including 49 rRNAs, 9 miRNAs, 12 snRNAs, 123 tRNAs, and 3 ribozymes.

Protein-coding gene structure annotation was performed using the MAKER v3.01.0432 pipeline, integrating three types of evidence to improve prediction accuracy. The specific workflow included the following steps: (1) Ab initio gene prediction: Gene models were predicted using BRAKER v3.0.333, incorporating both transcriptomic and proteomic evidence to expand the pool of potential coding gene candidates. Transcriptomic data were generated by aligning second-generation RNA-Seq reads to the reference genome using HISAT2 v2.2.134, while long-read transcriptome data were aligned using Minimap2. Both pipelines produced BAM alignment files as input for downstream analyses. (2) Transcript alignment-based gene structure prediction: StringTie v2.2.135 was used to perform reference-guided assembly of both second- and third-generation transcriptomic data, using the short-read and long-read BAM alignment files generated in step (1) as input. (3) Homology-based gene prediction: Homology comparisons were conducted using known protein sequences from five related Acari species: Panonychus citri (RefSeq: GCF_014898815.1), Tetranychus urticae (RefSeq: GCF_000239435.1), Oppia nitens (RefSeq: GCF_028296485.1), Dermatophagoides farinae (RefSeq: GCF_020809275.1), and D. pteronyssinus (RefSeq: GCF_001901225.1). Protein sequence analysis was carried out using GeMoMa v1.936, as described above. Detailed results of the coding gene predictions are presented in Table 4. A total of 10,330 protein-coding genes were predicted by the MAKER pipeline, with a combined length of 5,902,328 bp and an average gene length of 571.4 amino acids. BUSCO assessment of the predicted protein-coding sequences showed high completeness, with 94.7% of the 1,123 core orthologs identified as complete [including 93.5% single-copy (S) and 1.2% duplicated (D)], while only 2.0% were fragmented (F) and 3.4% missing (M).

Gene function annotation was performed by aligning gene sequences to established databases. We used Diamond v2.1.837 to search the UniProtKB database (including SwissProt and TrEMBL) and the NCBI nr database (February 7, 2024) for functional gene information. Subsequently, InterPro 5.74-105.0638 and eggNOG-mapper v2.1.1239 were employed to further enrich functional annotations and identify protein domains. The results showed that the five databases annotated 6,989, 9,022, 9,098, 9,621, and 8,116 functional genes, respectively. After removing duplicates, a total of 9,890 genes were functionally annotated, with 6,446 genes being commonly identified across all four databases. Detailed results are presented in Table 4 and visualized in Fig. 4.

Fig. 4

Venn diagram of functional annotations for C. lactis. Numbers indicate the count of genes in each intersection, The color gradient corresponds to the gene count.

Data Records

The raw sequencing reads and genome assembly have been deposited in the NCBI database under BioProject accession number PRJNA1260190[40](https://www.nature.com/articles/s41597-025-06038-w#ref-CR40 “NCBI BioProject https://identifiers.org/ncbi/bioproject:PRJNA1260190

(2025).“). The WGS, Hi-C, HiFi, RNA-seq, and ONT data are publicly available under accession number SRP588867[41](https://www.nature.com/articles/s41597-025-06038-w#ref-CR41 “NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP588867

(2025).“). The complete genome assembly is accessible through NCBI with the accession number GCA_051106175.1[42](https://www.nature.com/articles/s41597-025-06038-w#ref-CR42 “NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_051106175.1

(2025).“). Additionally, the genome assembly and annotation files have been made publicly available via Figshare[43](https://www.nature.com/articles/s41597-025-06038-w#ref-CR43 “Liang, L. et al. Genome Assembly and Annotation of Carpoglyphus lactis. Figshare https://doi.org/10.6084/m9.figshare.29815598

(2025).“).

Technical Validation

The quality of the C. lactis genome assembly was assessed using two complementary approaches. First, completeness was evaluated using BUSCO v5.8.3 with the arachnida_odb12 dataset (n = 1,123) (Fig. 5), revealing a total completeness of 94.6% (93.6% single-copy and 1.0% duplicated). Second, base-level accuracy was estimated using Merqury v1.4, yielding a high-quality value (QV) of 49.9, indicating exceptional sequence fidelity. These results collectively demonstrate that the genome assembly is highly contiguous and complete. Additionally, the gene prediction achieved a BUSCO completeness of 94.7% (93.5% single-copy and 1.2% duplicated), further confirming the overall high quality of both the assembly and annotation.

Fig. 5

BUSCO assessment results for the C. lactis genome assembly and annotated protein set.

Data availability

The data presented in this manuscript have not been previously published in any form. The raw sequencing reads and the assembled genome described in this study have been deposited in the National Center for Biotechnology Information (NCBI) databases under BioProject accession number PRJNA1260190. The corresponding genome annotation files are available in the figshare repository (https://doi.org/10.6084/m9.figshare.29815598).

Code availability

No custom scripts were used in this study. All data processing was carried out using standardized pipelines based on the bioinformatics tools described in the Methods section.

References

Schatz, H., Behan-Palletier, V. M., OConnor, B. M., & Norton, R. A. Suborder Oribatida van der Hammen, 1968. In: Zhang, Z.-Q.(Ed.) Animal biodiversity: An outline of higher-level classification and survey of taxonomic richness. Zootaxa. 3148 (1), 141–148 (2011). 1.

Klimov, P. B. et al. A transitional fossil mite (Astigmata: Levantoglyphidae fam. n.) from the early Cretaceous suggests gradual evolution of phoresy-related metamorphosis. Sci Rep-UK. 11(1), 15113 (2021).

Article CAS Google Scholar 1.

OConnor, B. M. Evolutionary ecology of astigmatid mites. Annu Rev Entomol. 27(1), 385–409 (1982).

Article Google Scholar 1.

Xiong, Q. et al. Comparative genomics reveals insights into the divergent evolution of astigmatic mites and household pest adaptations. Mol Biol Evol. 39(5), msac097 (2022).

Article CAS PubMed PubMed Central Google Scholar 1.

Houck, M. A. & OConnor, B. M. Ecological and evolutionary significance of phoresy in the Astigmata. Annu Rev Entomol. 36(1), 611–636 (1991).

Article Google Scholar 1.

Seeman, O. D. & Walter, D. E. Phoresy and mites: More than just a free ride. Annu Rev Entomol. 68(1), 69–88 (2023).

Article CAS PubMed Google Scholar 1.

Aliakbarpour, H. & Fan, Q. H. The genus Carpoglyphus (Acariformes: Carpoglyphidae). Zoosymposia. 22, 231–231 (2022).

Article Google Scholar 1.

Hubert, J., Nesvorna, M., Green, S. J. & Klimov, P. B. Microbial communities of stored product mites: variation by species and population. Microb Ecol. 81(2), 506–522 (2021).

Article ADS CAS PubMed Google Scholar 1.

Hubert, J., Nesvorna, M., Kopecký, J., Ságová‐Marečková, M. & Poltronieri, P. Carpoglyphus lactis (Acari: Astigmata) from various dried fruits differed in associated micro‐organisms. J. Appl. Microbiol. 118(2), 470–484 (2015).

Article CAS PubMed Google Scholar 1.

Li, C. P., Cui, Y. B., Wang, J., Yang, Q. G. & Tian, Y. Acaroid mite, intestinal and urinary acariasis. World J Gastroentero. 9(4), 874 (2003).

Article Google Scholar 1.

Bushnell, B. BBtools. Available online: https://sourceforge.net/projects/bbmap/ (accessed on 5 May 2025) (2014). 1.

Lee, S. H. et al. The global spread of jellyfish hazards mirrors the pace of human imprint in the marine environment. Environ Int. 171, 107699 (2023).

Article PubMed Google Scholar 1.

Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 11, 1432 (2020).

Article ADS CAS PubMed PubMed Central Google Scholar 1.

Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods. 18, 170–175 (2021).

Article ADS CAS PubMed PubMed Central Google Scholar 1.

Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics. 36(7), 2253–2255 (2020).

Article CAS PubMed Google Scholar 1.

Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics. 36, 2896–2898 (2020).

Article CAS PubMed PubMed Central Google Scholar 1.

Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics. 34, 3094–3100 (2018).

Article CAS PubMed PubMed Central Google Scholar 1.

Zhang, H. et al. Fast alignment and preprocessing of chromatin profiles with Chromap. Nat Commun. 12(1), 6566 (2021).

Article ADS CAS PubMed PubMed Central Google Scholar 1.

Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: Yet another Hi-C scaffolding tool. Bioinformatics. 39, btac808 (2023).

Article CAS PubMed Google Scholar 1.

Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst. 3, 95–98 (2016).

Article CAS PubMed PubMed Central Google Scholar 1.

Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience. 10, giab008 (2021).

Article PubMed PubMed Central Google Scholar 1.

Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol Biol Evol. 38, 4647–4654 (2021).

Article CAS PubMed PubMed Central Google Scholar 1.

Steinegger, M. & Söding, J. MMseqs. 2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 35, 1026–1028 (2017).

Article CAS PubMed Google Scholar 1.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990).

Article CAS PubMed Google Scholar 1.

Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).

Article CAS PubMed PubMed Central Google Scholar 1.

Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci. 117, 9451–9457 (2020).

Article ADS CAS PubMed PubMed Central Google Scholar 1.

Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA. 12, 2 (2021).

Article CAS PubMed PubMed Central Google Scholar 1.

Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 6, 11 (2015).

Article PubMed PubMed Central Google Scholar 1.

Smit, A., Hubley, R. & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org (2013/2015). 1.

Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 29, 2933–2935 (2013).

Article CAS PubMed PubMed Central Google Scholar 1.

Chan, P. P. & Lowe, T. M. TRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol Biol. 1962, 1–14 (2019).

Article CAS PubMed PubMed Central Google Scholar 1.

Holt, C. & Yandell, M. MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 12, 491 (2011).

Article PubMed PubMed Central Google Scholar 1.

Gabriel, L. et al. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Research 34(5), 769–777 (2024).

Article CAS [PubMed](http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&l

Background & Summary

Background & Summary

Methods

Sample preparation

DNA sequencing

RNA sequencing

Genome survey

Genome assembly

Genome annotation

Data Records

Technical Validation

Data availability

Code availability

References

Similar Posts