Speed reading DNA records (appeared in JFeb 2018)

(link to main website)

DNA can hold data securely, but we need ways to quickly find the data to read it, says S.Ananthanarayanan.

DNA, the molecular code found in the nuclei of living cells is the most compact and hardy data storage system we have encountered. The nucleus of microscopic cells contains a giant molecule, billions of units long and compactly folded, with the code for all the proteins that the cells of an organism would produce, and hence all the characteristics of the organism itself. The code then survives, essentially unaltered through the many times that cells divide, and the DNA molecule has been found reasonably intact in frozen animal remains that are even thousands of years old.

It has been estimated that a gram of DNA could store 215 petabytes, or 215 million gigabytes of data. In comparison, the best computer data recording systems store data in terabytes, or thousands of gigabytes and may weigh half a kilogram. And then, computer data storage devices degrade with age and what is worse, the technology that is used to create or read data from them keeps getting outdated. An efficient method to write digital data into a DNA-like molecular record and then retrieve the data would be a vast improvement in computing and data management.

While methods to write digital data into the DNA molecule have been developed, reading a reasonable quantity of data, once it is stored inside a DNA molecule is still a complicated process. This is the feature that is standing in the way of DNA storage growing into a practical strategy for preserving and handing large data. Lee Organick, Siena Dumas Ang, Yuan-Jyue Chen, Randolph Lopez, Sergey Yekhanin, Konstantin Makarychev, Miklos Z Racz, Govinda Kamath, Parikshit Gopalan, Bichlien Nguyen, Christopher N Takahashi, Sharon Newman, Hsing-Yeh Parker, Cyrus Rashtchian, Kendall Stewart, Gagan Gupta, Robert Carlson, John Mulligan, Douglas Carmean, Georg Seelig, Luis Ceze and Karin Strauss, from the University of Washington and Microsoft Research, Washington, describe, in their paper in the journal, Nature Biotechnology, an orders-of-magnitude improvement in the recording and speed of retrieval of sizeable data from a DNA record.

The DNA molecule consists of a pair of chains, or backbones, of units, called nucleotides or bases, as shown in the picture, with ‘side chains’ of four kinds of molecular groups along its length. Information is coded by the sequence of the four kinds of side chain groups. These groups are called, C, G, A and T and in the DNA, groups of three consecutive bases, each with one of four forms of side chain, code for twenty amino acids, the building blocks of proteins. Series of triads thus code for series of amino acids and hence for different proteins. There is a rule that a C can pair only with a G in the complementary chain and an A with T. This rule specifies the order of bases in one chain once the order in the other is fixed and this is what enables either chain to create a fresh complementary chain when the chains separate in cell division.

Digital data, which consists of series of ‘1’s and ‘0’s, can be coded in a similar fashion in a chain that can fit into the DNA structure. Methods have been developed to synthesize these chains and there is a technique to snip a DNA molecule at a particular spot, to insert the portion that codes the digital data. DNA in a living cell can then hold the data record, which would get replicated every time the cell divides. Reading the sequence of units in the DNA, or DNA sequencing, is now well developed and the digital record can hence by recovered.

The trouble, however, is with this last part of the process and the trouble is that the whole DNA needs to be sequenced before the digital record part can be extracted. Or within the record, the whole record has to be read to locate a portion which is of interest. Early computer records, which were created in spools of magnetic tape, had this same feature, of being placed, one after the other, along the length of the tape. Particular records were identified by a ‘header’ and an end marker, or by segments of the tape, but the tape had to be run from the beginning, till the required record was found.

Running through the length of a spool of tape consumed power and time and large part of computing time was spent in the ‘sequential search’ for the many items of data that could be required even for a small computation. A great development was with the floppy disk and the hard disk, which could be divided into tracks and sectors and a record could be directly accessed by its address, or known location on the disk. The disk is kept spinning, to rapidly scan the sectors and radial detectors can pick up data from the different tracks hundreds of times a second. This method of access to data is known as random access and was probably a larger step in increasing computer speeds than improvements in the processors.

What the researchers at Washington report in Nature is the equivalent in DNA record, of enabling location of a specific portion without having to sequence the whole DNA-like molecule. There are other issues too, that limit the value of the DNA for large-scale digital recording, the authors say. One such is the frequency of errors when the record is first written in. The way this is handled is by creating multiple copies of the data to write, so that there is ‘redundancy’. This method, however, consumes resources in creating, and ideally verifying the copies and the overheads of the process. In computer storage, there are devices, like the use of a ‘check digit’ or coding that enables recovery till more than a certain number of errors have occurred. The Washington researchers report an improved method of coding that substantially reduces the extent of redundancy, and hence the complexity of data preparation and writing effort that is required.

The main improvement, however, is in the random, or non-sequential access to records that has been made possible. This was done by creating a library of DNA stretches called ‘primers’, which were made with sufficient mutual differences so that they could be readily told apart. The Digital data to be recorded is first prepared with a degree of redundancy built in and formed into distinct segments according to a scheme. The segments are then converted into DNA sequences and a ‘primer’, taken from the library of ‘primers’ is attached to both ends of each sequence. This unique primer is the feature that would allow random access to a particular record, from a soup of the segments of all the data recorded. The DNA sequences are then put together as DNA strands and can be dried and preserved.

Retrieving data involves ‘rehydrating’ the DNA material and carrying out sequencing of the bits of DNA. A four-stage process is employed to filter out dissimilar strands of DNA and the portion of interest is separated by a process of iteration.

The method was tried out with 200 MB of data of different kinds and the entire data could be recovered. The paper says the most of DNA sequences had to be read just 5 times. “This is half as much as the minimum cover¬age ever reported in decoding digital data from DNA,” the paper says

The paper notes that as data storage in DNA needs synthetic DNA there is need for this industry to rack up to meet data demands. This should be possible, as the quality demands of data applications are not as high as those of the life sciences, the paper says. While DNA storage, because of its long-term durability, could be interesting even at the current stage, increasing throughput and reduction of cost is expected, the paper says.

------------------------------------------------------------------------------------------ Do respond to : response@simplescience.in