High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping

Background With the inherent high density and durable preservation, DNA has been recently recognized as a distinguished medium to store enormous data over millennia. To overcome the limitations existing in a recently reported high-capacity DNA data storage while achieving a competitive information capacity, we are inspired to explore a new coding system that facilitates the practical implementation of DNA data storage with high capacity. Result In this work, we devised and implemented a DNA data storage scheme with variable-length oligonucleotides (oligos), where a hybrid DNA mapping scheme that converts digital data to DNA records is introduced. The encoded DNA oligos stores 1.98 bits per nucleotide (bits/nt) on average (approaching the upper bound of 2 bits/nt), while conforming to the biochemical constraints. Beyond that, an oligo-level repeat-accumulate coding scheme is employed for addressing data loss and corruption in the biochemical processes. With a wet-lab experiment, an error-free retrieval of 379.1 KB data with a minimum coverage of 10x is achieved, validating the error resilience of the proposed coding scheme. Along with that, the theoretical analysis shows that the proposed scheme exhibits a net information density (user bits per nucleotide) of 1.67 bits/nt while achieving 91% of the information capacity. Conclusion To advance towards practical implementations of DNA storage, we proposed and tested a DNA data storage system enabling high potential mapping (bits to nucleotide conversion) scheme and low redundancy but highly efficient error correction code design. The advancement reported would move us closer to achieving a practical high-capacity DNA data storage system.


S1. Detailed encoded source file and data structure
We encoded 379,050 bytes of source data with sizes ranging from 1.8 KB to 224.5 KB into 12,000 oligos, among which 11,400 oligos were for the user data and 600 oligos were for the redundancy that ensure error resilience. There are six source files including five images and one text file. Four images are the Merlion (the official mascot of Singapore)(https://en.wikipedia.org/wiki/Merlion), the national flower of Singapore(https://www.nhb.gov.sg/what-we-do/our-work/community-engagement/education/re sources/national-symbols), the national proclamation of Singapore Nanyang Technological University, Singapore (NTU), the logo of National University of Singapore (NUS) and the text file containing Singapore national pledge in four languages is also included (Supplementary S1 Fig. 1).
Supplementary S1 Fig. 1: Source files of the stored data. Five images and one text file are encoded from binary stream into 11,400 binary sequences and then transferred into 11,400 DNA oligos. Each file is encoded into different sizes of oligonucleotide (oligo) sets.
To ensure each encoded DNA oligo with the length under the limit of current synthesis technique (200nt), the length of encoded DNA sequences was set to be around 150nt after excluding 40nt for two 20nt primer sites. Considering the theoretical mapping potential (the number of bits encoded in one nucleotide) of 2 bits/nt, the length of binary sequence was thus set to be 300 bits. As we planned to store data in DNA oligos with size of 10 4 magnitude, we then set aside 14 bits from 300 bits for addressing distinct oligos (14 > log 2 10 4 ). Moreover, we set aside another 20 bits for Cyclic Redundancy Check (CRC) for each 300 bits long sequence.
Therefore, resulting in 266 bits of each sequence to encode user data. We segmented the user data into 11,400 binary sequences with each packet length of 266 bits. Then the repeat-accumulate (RA) code was applied to these user sequences and redundant/parity packets were generated. After that, each binary packet was appended with 14 bits as the address for ordering the position of each sequence in the whole user data set. Besides that, 20 bits was added CRC check for detecting the interior errors occurring within the sequence. Next, all the binary packets with a total length of 300 bits were mapped to DNA sequences according to the hybrid mapping strategy, resulting in DNA sequences with lengths ranging from 150nt to 159nt. Before sending the sequences for oligo synthesis, each DNA sequence was attached with with common binding sites 5'-ATACCCAAGGGTAAACAGCG-3' and 5'-GCGGTTTCCAACCGGTAATA-3' at the start and end. Finally, each synthesized oligo was in the length ranging from 190nt to 199nt. The structure of the data sequence and oligo sequence is shown in Supplementary S1 Fig. 2.

S2. RA code design
In this work, we consider a systematic repeat accumulate (RA) code as error protection coding scheme. The rationale behind choosing RA code is their low encoding complexity. RA codes can be encoded by the serial concatenation of a repetition code, interleaver, combiner, and accumulator where the encoding complexity grows linearly with the code length (1,2). In general, a RA code can be represented by a bipartite graph, which consists of two types of variable nodes (namely information bit nodes and parity bit nodes) and one type of check nodes. The Tanner graph representation of a RA code is shown in Supplementary S2 Fig. 3 where is the fraction of information bit nodes of degree .
, where ℎ is the fraction of check nodes of degree + 2.
, where is the fraction of edges that are connected to degree information bit nodes.
, where is the fraction of edges that are connected to degree + 2 check nodes.
Where and correspond to the maximum degree of variable nodes and check nodes, respectively. For the above mentioned degree distributions, the code rate becomes (2) : Instead of a conventional bit-level coding, we consider a packet-level coding. As described earlier, we segment a large digital file into smaller packets of the same size. We consider these packets as source packets which will be used to generate some redundant or parity packets.
Note that every packet will be incorporated with CRC to detect errors in the packet. We will consider the packets that pass the CRC test as correctly recovered and others as erased.
Thus, the overall code design problem for the DNA storage becomes the code design for erasure channel. Let be the probability that error will occur in an oligo after the synthesis and sequencing processes. To ensure high reliability and robustness, in practice, code design is performed by considering slightly higher drop-out probability than the actual drop-out probability. Hence, we design the error protection code based on the dropout rate of oligos in synthesis and sequencing processes (3). We now show the density evolution analysis for RA code and then the optimization procedure to obtain capacity achieving RA code for a given .

A. Density Evolution
We present the density evolution of the RA code, which can be used to obtain the asymptotic threshold of a given RA ensemble. Let ( ) and ( ) be the erasure probabilities outgoing from the information nodes and the parity nodes, respectively to check nodes at iteration .
According to the density evolution analysis, we obtain the following update equations: where (0) = (0) = δ. From the above recursions, one can find the asymptotic threshold δ * by the following expression, * = { : We then optimize the RA code while utilizing the above density evolution equations [2,3].

B. Optimization
For simplicity, we first fix the check node degree distribution. Then we search for a variable node degree distribution ( ) such that, the resultant code becomes a capacity achieving for . With fixed ( ), the rate of the RA code can be maximized by maximizing ∑ λ j (r) j ⁄ j . Thus, we obtain the following optimization problem.
Finally, based on the dropout rate of oligos in synthesis and sequencing process that was used in (3), with the dropout rate of 1.3%, we designed the RA code such that the resultant code exhibited an asymptotic threshold higher than dropout probability 0.013. Following the optimization procedure described above, we design a RA code of rate 0.95, which gives an asymptotic threshold 0.0475. Thus, the resultant code shows only a gap of .0025 from the

S3. Mapping potential and Information density
We compare the mapping potential and the information density of existing schemes reported in (3)(4)(5)(6)(7)(8)(9). The mapping potential (also known as coding potential) is defined as the number of bits encoded in one nucleotide and the net information density is defined as the number of user bits encoded in one nucleotide (3). The net information density can be calculated by the ratio of total bits of user data against total nucleotides of encoded DNA fragments. On the other hand, it can also be estimated by NetInfo(bits/nt) = * * ⁄ , where is the code rate of outer code, L i is the length of information bits, L p is the overall packet length, is the mapping potential.
Church et al. (4)    To verify that the proposed coding scheme maintains high capacity performance with large data size, we assume to encode data of large sizes as some of the existing works in silico (3,8,9). The number of bits used for indexing should satisfy L id ≥ log 2 (N data /R c ) − log 2 (L P − L CRC − L id ), where L id is the length of indexing needed for all encoded packets including user packets and redundancy packets with N data /R c bits in total. Note that for each magnitude, the packet length L P is fixed to 300 to comply with the limitation of current synthetic biology (maximum ~160nt excluding primer binding sites), CRC length L CRC is fixed to 20 to detect errors in 300bits packets. Thus, to encode 2MB of data as (3), we need 16 bits as indexing to indicate all encoded oligos. Therefore, we can obtain an estimated net information density of

A. A pseudo code-based algorithm
A pseudo code-based algorithm for the proposed interleaved mapping is shown in

B. An example of the interleaved mapping
We now explain this algorithm using an example binary sequence '01111111110001010111'. Thus, the sequence is then sent to the interleaver. A variable with initial value of 0 is set to indicate the trial number of interleaving, which is also used to select the flag nucleotide for each output valid DNA sequence. One symbol from the set {A, T, G, C} corresponds to in {1, 2, 3, 4} will be appended at the end of the interleaved sequence, which acts as the flag to represent the interleaving pattern. We then check whether the resultant sequence meets the homopolymer run constraint. In this example, we assume the first iteration of interweaving is able to break the homopolymer run, so we set 'A' to indicate the decoder that the original DNA sequence is interleaved by the first pattern. However, in the case of failing to eliminate homopolymers within four trials, the original binary sequence will be sent to the second mapping method namely the variable-length constrained (VLC) mapping.

A. Variable-length constrained sequence code
Constrained codes have been widely used as a collection of codes in the constrained systems, such as the optical memory systems and the digital communication systems, to satisfy specific channel constraints including the run-length limit, dc-free etc. (12). In (13), the variable-length constrained sequence code (VCSC) has been devised with high code rate and low complexity features that outperforms most block codes. Specifically, it reported that the code efficiency of VCSC approaches 99%, where the efficiency is the ratio of code rate to the capacity of the constraint and the capacity of the constraint is defined in [5] where ( ) represents the number of constraint satisfying sequences of length (14): = lim →∞ log 2 ( ) [5] In this work, we harness this code as a capacity-approaching mapping strategy for constructing sequences that satisfy the homopolymer constraint of DNA-based storage. The rationale behind VCSC is that it maps a set of variable-length source words to a set of variable-length codewords where both the source words and codewords are prefix-free to ensure the instantaneous encoding and decoding (13).
A general procedure to construct a capacity approaching VCSC can be summarized as follows (15): i) The complete or incomplete minimal sets are constructed based on the finite-state transition diagram (FSTD) of a specific constrained system where the elements and the concatenations of elements in the minimal set all satisfy constrains of the system. ii) Candidate codeword sets are constructed based on the full extension or partial extension of the minimal set with a predefined depth or width, where extension means the concatenation of elements. iii) Determine the optimal mapping between the variable-length source words and the variable-length codewords of each potential candidate codeword set via geometric Huffman (NGH) technique and select the mapping that owns the highest average code rate.
Note that with the binary i.i.d. sources, the average code rate only depends on the lengths of the source words and codewords, which is calculated by where and represent the lengths of the ℎ pair of source word and codeword, respectively.

B. The (4,0,2) run-length limit (RLL) DNA storage system
The DNA-based storage can be considered as a constrained system with RLL because the maximum homopolymer run of this system is limited to 3 repetitive nucleotides.

S6. An example of variable-length constrained mapping
We use an example to help understand the VLC mapping strategy. Suppose that a binary user sequence is '1100-00-00-1101-01-111100-0', then we use the look-up table to encode the sequence into '01-1-1-02-2-001-1'. Notice that the last bit in the user data has no corresponding codeword, so we assign a codeword whose corresponding source word has the minimal suffix length with this bit as a prefix, here we map to '1' (alternative is mapping to '2'). It is worth to mention that the choice of the mapping of the last segment will not affect the decoding as the last bit de-mapped by any codeword will be the same. Then we pre-code this constrained code to RLL code with the change-of-state function y = −1 + (mod M), where y is the current output precoding symbol, y −1 is the last output pre-coded symbol, x is the current input symbol, M is the alphabet size of the system (12). Note that the initial state is set to be 0. Following the function, we pre-code along the code and finally obtain '01-2-3-31-3-330-1'. At last, we convert the quaternary symbols to nucleotide symbols and thus get the DNA oligo sequence, denoted by 'AT-C-G-GT-G-GGA-T'.

S7. Additional information on the comparison table
We make the comparison of our scheme with other existing schemes that used the oligo pool storage format in terms of the following aspects in Supplementary Table 2: • Mapping potential: the capacity of a nucleotide to encode binary bits.
• Redundancy: the ratio of the total synthesized oligos to oligos encoded user data.
• Robustness to dropouts: strategies used to resist dropouts, respectively.
• Error correction/ detection: ability to detect or correct errors that occurred in the processes of synthesis and sequencing.
• Full recovery: whether the decoder recovered all source data correctly.

S8. Interleaved patterns
We use four interleaved patterns for the interleaved mapping, the first of which is a default pattern. Each interleaver is in the length of 150 as each DNA fragment mapped via the binary-to-quaternary mapping is in the length of 150 (300/2).