Why are hard drive companies investing in DNA data storage?

The research community is excited about the potential of DNA to function as long-term archival storage. That’s largely because it’s extremely dense, chemically stable for tens of thousands of years, and comes in a format we’re unlikely to forget how to read. While there has been some interesting progress, efforts have mostly stayed in the research community because of the high costs and extremely slow read and write speeds. These are problems that need to be solved before DNA-based storage can be practical.

So we were surprised to hear that storage giant Seagate had entered into a collaboration with a DNA-based storage company called Catalog. To find out how close the company’s technology is to being useful, we talked to Catalog’s CEO, Hyunjun Park. Park indicated that Catalog’s approach is counterintuitive on two levels: It doesn’t store data the way you’d expect, and it isn’t focusing on archival storage at all.

A different sort of storage

DNA is a molecule that can be thought of as a linear array of bases, with each base being one of four distinct chemicals: A, T, C, or G. Typically, each base of the DNA molecule is used to hold two bits of information, with the bit values conveyed by the specific base that is present. So A can encode 00, T can encode 01, C can encode 10, and G can encode 11; with this encoding, the molecule AA would store 0000, while AC would store 0010, and so on. We can synthesize DNA molecules hundreds of bases long with high efficiency, and we can add flanking sequences that provide the equivalent of file system information, telling us which part of a chunk of binary data an individual piece of DNA represents.

The problem with this approach is that the longer the string of bits is that you want to store, the more time and money it takes. Robotic hardware performs the synthesis reactions, and each hardware unit can only synthesize a single DNA molecule at a time. The raw materials the hardware uses to perform that synthesis also add a cost for each stored molecule. While this isn’t a concern for small-scale demonstration projects, the costs quickly become prohibitive if you start storing large amounts of data. Citing a DNA synthesis cost of about .03 cents per base, Park said, “.03 cents times two bits per base pair times, say, gigabytes—that’s a lot of money. That’s millions of dollars.”

Park told Ars that Catalog started by rethinking the encoding process to get around this bottleneck. The company’s encoding starts with a library of dozens to hundreds of short pieces of DNA called oligos (short for oligonucleotide). Each bit in the data is then assigned a unique combination of oligos—you can think of this as a bit like a silicon processor assigning a bit in memory a unique, 64-bit address. If that bit is a 1, a robot can gather small samples of solutions containing each of the oligos needed to represent it and combine them with an enzyme that can link all of the oligos together.

The enzyme merges the oligos into a single, longer DNA molecule that contains the unique signature of the bit. If, in contrast, the bit is a zero, the corresponding DNA for its address isn’t synthesized.

All of the molecules that are produced can then be combined in a single solution (which can be dried out for long-term storage). To read the data, the population of DNA molecules is sequenced, and an algorithm recognizes the unique combination of oligos present in each molecule. The recognized addresses are assigned a 1; the rest, a 0. This restores the data that was encoded to digital form.

This system is far less efficient in data/DNA than storing two bits in every base. But the individual molecules remain small enough that it’s still an impressively compact and stable storage medium. And it saves significant time and money due to a fundamental asymmetry: It’s far cheaper to synthesize a lot of one specific DNA sequence than it is to synthesize small amounts of lots of different DNA sequences. So by assembling DNA using a small bit of a large volume of pre-made DNA, the cost of synthesis goes down dramatically. Each assembly reaction can also be run in parallel; in contrast, synthesizing individual sequences ties up the machine they’re running on until the synthesis is complete.