Unknown

Dataset Information

0

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.


ABSTRACT: Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, restoring synchronization for correction via a standard Reed-Solomon outer code that is interleaved across strands. Moreover, HEDGES can incorporate a broad class of user-defined sequence constraints, such as avoiding excess repeats, or too high or too low windowed guanine-cytosine (GC) content. We test our code both via in silico simulations and with synthesized DNA. From its measured performance, we develop a statistical model applicable to much larger datasets. Predicted performance indicates the possibility of error-free recovery of petabyte- and exabyte-scale data from DNA degraded with as much as 10% errors. As the cost of DNA synthesis and sequencing continues to drop, we anticipate that HEDGES will find applications in large-scale error-free information encoding.

SUBMITTER: Press WH 

PROVIDER: S-EPMC7414044 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints.

Press William H WH   Hawkins John A JA   Jones Stephen K SK   Schaub Jeffrey M JM   Finkelstein Ilya J IJ  

Proceedings of the National Academy of Sciences of the United States of America 20200716 31


Synthetic DNA is rapidly emerging as a durable, high-density information storage platform. A major challenge for DNA-based information encoding strategies is the high rate of errors that arise during DNA synthesis and sequencing. Here, we describe the HEDGES (Hash Encoded, Decoded by Greedy Exhaustive Search) error-correcting code that repairs all three basic types of DNA errors: insertions, deletions, and substitutions. HEDGES also converts unresolved or compound errors into substitutions, rest  ...[more]

Similar Datasets

| S-EPMC4498232 | biostudies-literature
| S-EPMC10075190 | biostudies-literature
| S-EPMC8776549 | biostudies-literature
| S-EPMC3853030 | biostudies-literature
| S-EPMC5409172 | biostudies-literature
| S-EPMC6007263 | biostudies-literature
| S-EPMC5503945 | biostudies-literature
| S-EPMC9116035 | biostudies-literature
| PRJEB32885 | ENA
| S-EPMC6662698 | biostudies-literature