Abstract:DNA has emerged as a promising alternative for long-term data storage due to its high capacity, durability, and low-energy potential. However, storing data in DNA presents several challenges. First, it requires complex and costly biochemical processes, making efficient compression crucial to reducing DNA synthesis time and cost. Second, these processes are prone to errors that must be avoided and/or corrected. In particular, homopolymers (repetitions of the same nucleotide) are a wellknown source of errors during the sequencing step. Avoiding such repetitions helps mitigate errors but introduces a constraint that may increase the data compression rate. In this paper, we propose two transcoding methods that address these two k…
Abstract:DNA has emerged as a promising alternative for long-term data storage due to its high capacity, durability, and low-energy potential. However, storing data in DNA presents several challenges. First, it requires complex and costly biochemical processes, making efficient compression crucial to reducing DNA synthesis time and cost. Second, these processes are prone to errors that must be avoided and/or corrected. In particular, homopolymers (repetitions of the same nucleotide) are a wellknown source of errors during the sequencing step. Avoiding such repetitions helps mitigate errors but introduces a constraint that may increase the data compression rate. In this paper, we propose two transcoding methods that address these two key challenges: reducing data rate and minimizing errors. The first method strictly enforces the error-minimization constraint by eliminating homopolymers of a certain length, at the cost of an increased data rate. In contrast, the second method accepts a slight increase in homopolymers. However, we show that these increases remain limited (2.14% increase in compression rate for the first method and 0.39% homopolymer rate for the second). These two approaches demonstrate that it is possible to efficiently constrain transcoding while balancing error minimization and compression performance.
| Subjects: | Other Quantitative Biology (q-bio.OT); Multimedia (cs.MM) |
| Cite as: | arXiv:2511.14771 [q-bio.OT] |
| (or arXiv:2511.14771v1 [q-bio.OT] for this version) | |
| https://doi.org/10.48550/arXiv.2511.14771 arXiv-issued DOI via DataCite | |
| Journal reference: | 2025 IEEE International Conference on Image Processing (ICIP), Sep 2025, Rennes, France |
Submission history
From: Sara AL SAYYED [view email] [via CCSD proxy] [v1] Tue, 7 Oct 2025 08:22:32 UTC (1,897 KB)