Project description:Data storage in DNA has recently emerged as a promising archival solution, offering space-efficient and long-lasting digital storage solutions. Recent studies suggest leveraging the inherent redundancy of synthesis and sequencing technologies by using composite DNA alphabets. A major challenge of this approach involves the noisy inference process, obstructing large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering, in some implementations, a 6.5-fold increase in logical density over standard DNA-based storage systems, with near-zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter consists of a subset of shortmers. We formally define various combinatorial encoding schemes and investigate their theoretical properties. These include information density and reconstruction probabilities, as well as required synthesis and sequencing multiplicities. We then propose an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional (2D) error correction codes, and reconstruction algorithms, under different error regimes. We performed simulations and show, for example, that the use of 2D Reed-Solomon error correction has significantly improved reconstruction rates. We validated our approach by constructing two combinatorial sequences using Gibson assembly, imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage and describes some theoretical research questions and technical challenges. Combining combinatorial principles with error-correcting strategies, and investing in the development of DNA synthesis technologies that efficiently support combinatorial synthesis, can pave the way to efficient, error-resilient DNA-based storage solutions.
Project description:Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.
Project description:DNA-based data storage has emerged as a promising method to satisfy the exponentially increasing demand for information storage. However, practical implementation of DNA-based data storage remains a challenge because of the high cost of data writing through DNA synthesis. Here, we propose the use of degenerate bases as encoding characters in addition to A, C, G, and T, which augments the amount of data that can be stored per length of DNA sequence designed (information capacity) and lowering the amount of DNA synthesis per storing unit data. Using the proposed method, we experimentally achieved an information capacity of 3.37 bits/character. The demonstrated information capacity is more than twice when compared to the highest information capacity previously achieved. The proposed method can be integrated with synthetic technologies in the future to reduce the cost of DNA-based data storage by 50%.
Project description:Polypeptides consisting of amino acid (AA) sequences are suitable for high-density information storage. However, the lack of suitable encoding systems, which accommodate the characteristics of polypeptide synthesis, storage and sequencing, impedes the application of polypeptides for large-scale digital data storage. To address this, two reliable and highly efficient encoding systems, i.e. RaptorQ-Arithmetic-Base64-Shuffle-RS (RABSR) and RaptorQ-Arithmetic-Huffman-Rotary-Shuffle-RS (RAHRSR) systems, are developed for polypeptide data storage. The two encoding systems realized the advantages of compressing data, correcting errors of AA chain loss, correcting errors within AA chains, eliminating homopolymers, and pseudo-randomized encrypting. The coding efficiency without arithmetic compression and error correction of audios, pictures and texts by the RABSR system was 3.20, 3.12 and 3.53 Bits/AA, respectively. While that using the RAHRSR system reached 4.89, 4.80 and 6.84 Bits/AA, respectively. When implemented with redundancy for error correction and arithmetic compression to reduce redundancy, the coding efficiency of audios, pictures and texts by the RABSR system was 4.43, 4.36 and 5.22 Bits/AA, respectively. This efficiency further increased to 7.24, 7.11 and 9.82 Bits/AA by the RAHRSR system, respectively. Therefore, the developed hexadecimal polypeptide-based systems may provide a new scenario for highly reliable and highly efficient data storage.
Project description:Combinatorial chemistry invented nearly 40 years ago was welcomed with enthusiasm in the drug research community. The method offered access to a practically unlimited number of new compounds. The new compounds however are mixtures, and methods had to be developed for the identification of the bioactive components. This was one of the reasons why the method could not providethe expected cornucopia of new drugs. Among the different screening methods, two approaches seem to offer the best results. One of them is based on the intrinsic property of the combinatorial split and pool solid-phase synthesis: One compound forms on each bead of the solid support. Different methods have been developed to encode the beads and identify the structure of compounds formed on them. The most important method applies DNA oligomers for encoding. As a second approach in screening, DNA-encoded combinatorial libraries are synthesized omitting the solid support and the mixtures are screened in solution using affinity binding methods. Libraries containing billions and even trillions of components are synthesized and successfully tested, which led to the identification of a significant number of new leads.
Project description:DNA is a natural storage medium with the advantages of high storage density and long service life compared with traditional media. DNA storage can meet the current storage requirements for massive data. Owing to the limitations of the DNA storage technology, the data need to be converted into short DNA sequences for storage. However, in the process, a large amount of physical redundancy will be generated to index short DNA sequences. To reduce redundancy, this study proposes a DNA storage encoding scheme with hidden addressing. Using the improved fountain encoding scheme, the index replaces part of the data to realize hidden addresses, and then, a 10.1 MB file is encoded with the hidden addressing. First, the Dottup dot plot generator and the Jaccard similarity coefficient analyze the overall self-similarity of the encoding sequence index, and then the sequence fragments of GC content are used to verify the performance of this scheme. The final results show that the encoding scheme indexes with overall lower self-similarity, and the local thermodynamic properties of the sequence are better. The hidden addressing encoding scheme proposed can not only improve the utilization of bases but also ensure the correct rate of DNA storage during the sequencing and decoding processes.
Project description:DNA-based data storage is an emerging nonvolatile memory technology of potentially unprecedented density, durability, and replication efficiency. The basic system implementation steps include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders. Here we show for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. The novelty of our approach is to design an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. Our work represents the only known random access DNA-based data storage system that uses error-prone nanopore sequencers, while still producing error-free readouts with the highest reported information rate/density. As such, it represents a crucial step towards practical employment of DNA molecules as storage media.