Harvard cracks DNA storage, crams 700 terabytes of data into a single gram

skosuri · on Aug 17, 2012

I'm an author of the paper. The title of this article is misleading; first, we encoded 650kB and made 70 billion copies... second, those 70 billion copies weigh 1 milligram... third, it's really only meant for archival purposes as it's immutable and not random access... fourth, it's expensive right now (at least this might be a solvable problem).

brfox · on Aug 17, 2012

Nice work, and thanks for replying to all the questions here. I didn't read the paper, maybe it is addressed in there, but how much did it cost to synthesize that much DNA? Also, it could be random access if you PCR amplified the fragments you need based on the barcodes - you could even make a FAT (file allocation TUBE) which has all the file names and their barcodes.

skosuri · on Aug 18, 2012

approximately a couple of thousand dollars.

while that's true (re: scalability), it's not what we did, and would be hard to scale.

apathy · on Aug 18, 2012

thank you so much for the clarifications. I read the ScienceExpress writeup and was quite puzzled by where the 700TB were hiding. "You could store a hell of a lot of FASTQ reads in 700TB of DNA... hey, wait a second..."

ChuckMcM · on Aug 17, 2012

I really liked their paper. Its a bit less over the top than the extremetech guys but hey, that is the difference between pop journalism and science.

Clearly with some form of fountain code or LDPC codes you will be able to get the data back, but what struck me is that I always thought of DNA as relatively unstable, in the sense that cells decay/die etc, but the fact that just sitting there, DNA which isn't expressing various proteins under the influence of other cellular mechanisms, well it just sits there. That was new for me.

When I showed it to my wife she pointed out that the sourdough starter she has been using since we were married was from her grandmother, I joked that the next megaupload type raid would have to sequence all the DNA the found in a place to figure out if Shrek3 was encoded in it somewhere. That would be painfully funny I think.

skosuri · on Aug 17, 2012

well the dna we used never touched the inside of a cell; plain dried dna is very stable, evidenced by the ability to sequence 10's of thousands of years old samples stored in decaying flesh (albeit with some errors).

mbreese · on Aug 17, 2012

A bit of support to that: it used to be common to send plasmids (circular strand of DNA) to people on paper. You'd take a drop of the plasmid, put it onto the paper, and let it dry. When the recipient got the card in the mail, they'd cut out around your dried plasmid and put it back into water. It would be perfectly functional, if needed.

If you needed more, you could then transform E. coli with the plasmid and let them do the work for you.

I don't know if this is done very much anymore, but this helps to show just how robust DNA can be. I want to say that a few years ago, someone tried to sell our lab an archival tool for DNA that was essentially this. You dried your DNA samples onto blotting paper in a grid, then when you needed a sample, you could go back and reconstitute by punching it out of the paper.

jforman · on Aug 17, 2012

Plasmids are still commonly sent in the mail dried on paper. At least they were when I dropped out of my Ph.D. program a few years ago.

vph · on Aug 17, 2012

do you have a link to their paper?

skosuri · on Aug 17, 2012

I posted our paper on dropbox (until i get into too much trouble over it): paper http://db.tt/ZDoDJZeD supplement http://db.tt/elIqsy72

brown9-2 · on Aug 17, 2012

thank you!

ChuckMcM · on Aug 17, 2012

If you are a AAAS member: http://www.sciencemag.org/content/early/2012/08/15/science.1... (which is where I got it)

I don't know if there is a copy outside the Science Paywall, perhaps Sriram could answer that.

colanderman · on Aug 17, 2012

If you store data onto 50 DNA strands, can you always read back all the data from all 50 strands, or does one need to store multiple copies of each in case the sequencer can't "find" a particular strand?

If one does need multiple copies, it would seem that this method suffers from the coupon collector's problem [1] (i.e. to collect all 50 strands requires collecting 225 random strands on average), and that the retrieval rate could be improved by using a fountain code [2], which allows each strand to simultaneously encode data at multiple addresses, which would decrease the number of strands required to be sampled to only slightly more than the number of strands worth of data requested.

[1] http://en.wikipedia.org/wiki/Coupon_collectors_problem [2] http://blog.notdot.net/2012/01/Damn-Cool-Algorithms-Fountain...

skosuri · on Aug 17, 2012

To sequence we used about ~100x synthetic coverage on average and ~1000x sequencing coverage; so that's a whole lot of coverage. even then we did have 10 bit errors, but all the data blocks were recovered

ipince · on Aug 17, 2012

I'm sorry, can you clarify what that means? You wrote each piece of data 100 times? what does the 1000 refer to?

skosuri · on Aug 17, 2012

we printed the dna microarray using agilent's ink jet process (agilent is a spin off of hp; imagine an inkjet with actg instead of cmyk). each spot on the array has many hundreds of thousands of molecules. after we cleave the dna off the array, we take a portion of that, amplify and the sequence it. the portion we took off we estimate ~100 molecules for every oligo we made (55,000). we the sequence to get >55 million reads (so 1000x coverage); but these are just averages, and the distribution varies. you can check out the supplement of the paper if you are interested.

stavros · on Aug 18, 2012

> imagine an inkjet with actg instead of cmyk

Holy shit, this exists? Wow.

skosuri · on Aug 18, 2012

http://www.genomics.agilent.com/GenericB.aspx?PageType=Custo...

somesaba · on Aug 17, 2012

I think it refers to 1000x coverage on read which is then averaged

shinratdr · on Aug 17, 2012

HN formatting breaks the first Wikipedia link by removing the apostrophe. I ran it through a link shortener as a workaround.

http://cl.ly/1r153b103k2P

i_cannot_hack · on Aug 17, 2012

One could also replace the apostrophe with %27

http://en.wikipedia.org/wiki/Coupon_collector%27s_problem

sadga · on Aug 17, 2012

Someone modified Wikipedia as a workaround.

jwr · on Aug 17, 2012

Does anybody know how to escape their horrible "mobile" version that they force onto ipad users? It can't even be zoomed :-(

More and more often I find myself not reading articles because someone thought it would be a great idea to create a non-scrolling, non-obvious, paginated "iPad format" with additional misleading and unintuitive buttons looking like native ones but doing something different.

Sorry for the rant.

EDIT: so you might as well access the original at http://hms.harvard.edu/content/writing-book-dna instead of the ad-ridden regurgitation.

skosuri · on Aug 17, 2012

I thought io9 did a better job with the coverage; i put a copy of our papers for a limited time on my dropbox:

paper http://db.tt/ZDoDJZeD supplement http://db.tt/elIqsy72

jessriedel · on Aug 17, 2012

I'm considering buying an iPad, and I have a question: is there no browser on the iPad that allows you to choose whether you want to the mobile version or the regular version? If not, that's almost a deal breaker.

aerique · on Aug 17, 2012

Yes those browsers are available. Chrome has a "request desktop site" option and my personal favourite, iCab Mobile, has an extensive list of agent strings (custom ones can be added as well) that can be set.

jessriedel · on Aug 17, 2012

Phew, thanks.

duaneb · on Aug 17, 2012

I honestly haven't run into a site (except this one) that forces mobile sites on you without allowing you to opt out - however, I also use Chrome, so my being a mobile device may not be properly detected yet.

gte910h · on Aug 17, 2012

You can use bookmarklets in safari as well usually.

js2 · on Aug 17, 2012

Non-swipe link http://www.extremetech.com/extreme/134672-harvard-cracks-dna...

whatgoodisaroad · on Aug 17, 2012

Use the chrome browser for iPad and request the desktop version.

UnFleshedOne · on Aug 17, 2012

On android there is "request desktop version" button in chrome. Maybe there is something similar for ipads?

Jun8 · on Aug 17, 2012

And, of course, this brings us to the question: Do we already have messages in our DNA? Here's a post (from 2007) on this: http://blog.sciencefictionbiology.com/2007/07/messages-in-ou.... Actually, if it's from the aliens who seeded life on Earth, it would probably be in a prokaryotic DNA perhaps?

mbreese · on Aug 17, 2012

Probably not... If we were seeded by aliens, then the message would have had to have been in a very primitive form, so it would necessarily be seen in all living things, including prokaryotes. Unfortunately bacterial genomes are much, much, smaller than humans, so there isn't much room to waste for hidden alien messages. Additionally, presumably, these messages wouldn't be functional, so they would be under any sort of evolutionary pressure. This means that they would likely be mutated or lost to natural selection.

That's not to say that we weren't seeded, just that if we were, any message would likely have been lost. Unless... maybe they seeded mitochondria intact, which are in all (ok, most) eukaryotes. Maybe that might work... :)

Unfortunately, chrM is pretty small too... so no hidden messages there either :(

wbizzle · on Aug 17, 2012

This article is incredibly misleading. First of all there is an inconsistency. The headline says they stored 700 terrabytes (4.4 petabytes). It then later says that they actually stored 700 kilobytes (Their book) and that they did made 70 billion copies (44 petabytes?). The main thing is that storing 700 kilobytes and then making 700 billion copies is considerably less useful than storing 70 billion terabytes outright. Aside from that though, this is awesome, and a huge step forward into promising and uncharted territory.

gibybo · on Aug 17, 2012

It says they stored 44 petabytes total (70 billion * 700 kilobytes), and 700 terabytes in a single gram (which is 5.5 petabits, not 4.4 petabytes).

skosuri · on Aug 17, 2012

in the paper, we say 1.5mg per petabyte at large scales. we only encode 650kB or so. this seems a little sensationalistic. we are far away from being able to do 1 petabyte of arbitrary information.

gibybo · on Aug 18, 2012

Wait, 1.5mg per petabyte at large scales? Wouldn't that mean a gram could hold (1000/1.5) 667 petabytes and presumably scalable to many grams (eventually)? I understand it's only 650kB right now, but the density is obviously still incredible.

skosuri · on Aug 18, 2012

yes. density is astounding. that's why it's in science i presume. mostly because people forget how dense dna information really is.

apathy · on Aug 18, 2012

> i presume

you are either the most humble senior author ever or jerking all our chains. possibly both. congratulations either way.

skosuri · on Aug 18, 2012

this is my first senior author paper; i imagine i'll be more haughty after I start teaching a few classes.

nemo1618 · on Aug 17, 2012

First thing that came to mind was Stross' "Memory Diamond" - http://www.antipope.org/charlie/blog-static/2007/05/shaping_...

pronoiac · on Aug 17, 2012

Wow. For scale, the Internet Archive had 5.8 petabytes of data in December 2010 [1] - so, about 9 grams' worth. How much did this cost?

[1] http://archive.org/web/petabox.php

skosuri · on Aug 17, 2012

i think you have the numbers wrong. at least in the supplement of our paper we say a petabyte would weigh ~1.5 milligram. It would be far too expensive to do that though; about 6-8 orders of magnitude increase in scale is necessary from current technologies. that said, we've seen that kind of drop over the last decade or so; here's to keeping it going.

conanite · on Aug 17, 2012

They're using T and G for a 1, and A and C for a 0; why not double the density and get two bits from each letter?

  T = 00
  G = 01
  A = 10
  C = 11

for example.

skosuri · on Aug 17, 2012

we didn't because we wanted to avoid particular sequence features that are difficult to synthesize and sequence. we probably could have gotten away with something like 1.8 bits per base, but we were already doing fine on density, so we thought a 2x hit wouldn't be that bad.

skosuri · on Aug 17, 2012

i should clarify, because of extra sequence and the address barcode, we are technically only 0.6 bits/base

seiji · on Aug 17, 2012

Base pairs. They don't occur individually.

lupatus · on Aug 17, 2012

IANA DNA expert, but they still seem to be ordered [1]. So a possible scheme to increase data density could be:

AT = 00

TA = 01

CG = 10

GC = 11

The trick would be to always correctly identify which is the left and which is the right strand. I don't know if that is possible in practice though.

[1] http://en.wikipedia.org/wiki/Base_pair#Examples

ajross · on Aug 17, 2012

Ribosomes seem to manage just fine. :)

You just encode a big marker (making sure it's not a palindrome-paired version of itself!) as a header. If you see that, it's a correct order. If not, it's not.

lupatus · on Aug 17, 2012

This header idea is great because then you only need to keep one strand and can toss the other, potentially quadrupling the amount of data storage (I'm assuming you can keep single strands of DNA stable).

[Left strand]

A = 00

T = 01

C = 10

G = 11

[Right strand]

T = 00

A = 01

G = 10

C = 11

Anyone know these guys at Harvard, b/c this might be a way to put, at most, 2800 terabytes in a gram? (I don't know how long the header sequences would have to be).

kolinko · on Aug 18, 2012

It's possible to have single stranded DNA, but you'd have problems with error correction.

Let's say DNA breaks, or some errors appear in the code. Thanks to the double stranded structure it's "quite easy" to repair the code.

Besides that, it's not the density which is a problem right now, but the access speed. The amount of data in DNA is so immense that doubling the density won't give any practical improvements for decades to come - if ever.

Having said that, if I'm not mistaken, some viruses are encoded by single stranded DNA & ssRNA. I'm not sure, but the density might be the reason for that.

sliverstorm · on Aug 18, 2012

I don't think you can simply toss one of the strands. DNA is so compact because of the way it coils, and you likely lose that if you only have one strand.

bhickey · on Aug 17, 2012

I suspect that George's lab used a sensible encoding scheme. They're fairly sharp.

I can't find a copy of the original article, but it's certainly more informative than some science journalism fluff piece.

sp332 · on Aug 17, 2012

They might not have used a "coding" scheme at all, if they were interested in characterizing the frequency and types of errors.

algorias · on Aug 17, 2012

that's irrelevant since you have 2 strands that are mirror copies of each other. just prefix your data with a single A, if it's read as a T, just invert the bits of the rest of the strand.

InclinedPlane · on Aug 17, 2012

I'm not sure exactly why they decided on that encoding, I suspect there is some technical reason such as error tolerance.

skosuri · on Aug 17, 2012

it's to avoid sequence features which are problematic like high GC content

InclinedPlane · on Aug 17, 2012

Ah, interesting. In natural DNA this is achieved by using a more complex encoding scheme (3 base pairs -> one amino acid), combined with the vast majority of DNA not encoding genes directly.

skosuri · on Aug 17, 2012

well, the genetic code has presumably evolved to evolve; it's unclear that it's evolved to preserve information

tarice · on Aug 17, 2012

I notice that the article fails to mention how long it would take to extract all 700 terabytes of data...

Assuming 5.5 petabits stored with 1 base pair representing 1 bit, we can extrapolate the time required to extract the data based off the time taken to sequence the human genome (3 billion base pairs).

5.5 petabits / 3 billion bits ~= 2 million, so theoretically it should take 2 million times longer to sequence the original.

3 years ago, there was an Ars Technica article about how it now only takes 1 month to sequence a human genome[1]; the article now claims that microfluidic chips can perform the same task in hours.

Assuming 2 hours (low end) to sequence the human genome:

2 hours * 2 million = 4 million hours = 456 years, give or take a few years.

So, maybe not so great for storing enormous amounts of data. But if you want to store 1 GB, it would only take ~6 hours. Not too bad.

[1]http://arstechnica.com/science/2009/08/human-genome-complete...

skosuri · on Aug 17, 2012

we talk about a potential petabyte storage mechanism in the supplement:

paper http://db.tt/ZDoDJZeD supplement http://db.tt/elIqsy72

tldr; we are 6-8 orders of magnitude away from doing petabytes routinely; that said costs of sequencing/synthesis have seen such drops over the last decade or so. there are many barriers though for that continuing for another decade

epistasis · on Aug 17, 2012

I think that DNA, if ever used in practice, will be primarily used for long-term, archival storage of data that's rarely accessed, if ever. However there are a couple of things that alleviate the problems with reading the DNA.

1) DNA sequencing technology is currently advancing much faster than silicon technology, so give it time and it's likely that it will catch up with current hard disk reading speed at comparable sizes and volumes. 2) DNA's self-hybridizing nature makes it easy to pull out blocks with specific addresses (if you wait for the hybridization). So if you include address labels in the DNA as you write it out, you can probably pull it out in chunks of kilobase to megabase at a time. 3) As the other commenter pointed out, this is extremely easy to parallelize. So if you want to go twice as fast, divide the sample in half put half in each machine. Dilute and pipette as necessary.

scarmig · on Aug 17, 2012

Hmm. DNA is fairly easy to duplicate though, right? Wouldn't that allow an exponential speedup?

tarice · on Aug 17, 2012

I assumed that the microfluidic chip speed listed would include parallel processing.

Even if it doesn't, you'd still need something like 5,000 experiments in parallel for it to take less than a month...

JunkDNA · on Aug 17, 2012

Yes, the microfluidics used today make use of this for reading large numbers of small segments of DNA in parallel. The current "gold standard" for DNA sequencing (manufactured by Illumina) uses millions of tiny fragments of DNA which are read optically as DNA sequence is extended.

X4 · on Aug 18, 2012

They're teaching us this in ComputerScience and I wonder if this is total crap or not. Can you please shed light into this?

"In humans, the deoxyribonucleic acid (DNA, Germ. DNS) is the carrier of genetic information, and the main constituent of the chromosomes.

DNA is a chain-like polymer of nucleotides, which differ in their nitrogen bases (Thymin/Cytosin bzw. Adenin/Guanin,) The alphabet of the code is therefore: {Thymin, Cytosin, Adenin, Guanin,} or also { T, C, A, G } Three consecutive bases form a word So there are 43 = 64 combinations per word so the word length is ld (64) bits = 6 bits A gene contains about 200 words A chromosome contains about 104 to 105 genes The number of chromosomes per cell nucleus is 46 in humans The stored data per nucleus have, a volume of 6 bit * 200 * 10^5 * 46 = 55200 bit * 10^5 * 5 * 10^9 bit * * 10^9 Byte = 1 GByte"

apinstein · on Aug 18, 2012

You aren't far off, but there are some errors. Excuse in advance any errors below, it's been a while since I did this stuff daily.

Human genome is 23 chromosomes, 2 copies of each (46). They are not mirror copies like RAID1, but rather alleles (different versions of same gene). So you might have a different version of a gene on each half of the chromosome. This is the basis for sexual selection and why sexual organisms can evolve (esp at the population level) so much faster than asexual ones. The two alleles can be the same as well. It's called homozygous or heterozygous. This is basic Mendelian genetics. Most genes work this way, though many are more complex.

ATGC is correct, so each "bit" is base 4.

Genes encode proteins; every 3 base pairs is a codon which specifies which Amino Acid to use. While there could be 4^3=64 in practice there are only 20 amino acids used in nature to make functional proteins. Genes vary greatly in length, not sure where you got ~200 codons/gene but that's terribly wrong. Maybe close to average, but the range is large. In any case, for data storage, anything relating to codons and proteins would be irrelevant.

Also in practice not all of a chromosome encodes proteins. There is often lots of buffer region between genes, not to mention a lot of control flow sequences that help control expression of genes. Beyond that, the ends of chromosomes don't have many genes, they just contain pseudo-random information and are still being explored (junk dna, telomeres, etc).

X4 · on Aug 18, 2012

>> And how much data did they fit into one DNA pair?

Info: According to Quora the avg. weight of the Human DNA is 60g, that means a Human could carry 700TB * 60g = 42000 TB Note: 700TB = 7.69658139 × 10^14 bytes

schiffern · on Aug 18, 2012

>To store the same kind of data on hard drives — the densest storage medium in use today — you’d need 233 3TB drives, weighing a total of 151 kilos.

But hard drives aren't the densest storage medium in use today. A microSD card can hold up to 64 gigabytes and is 0.5 grams. 700 terabytes would be only 5.6 kilograms.

AaronBBrown · on Aug 17, 2012

What's the latency/throughput on reading the data back?

gersh · on Aug 17, 2012

Can we encode all of human knowledge into the DNA of some organism? How can organisms access data stored in their DNA? Imagine being born with knowledge of every Wikipedia article, or even every website. What would that be like?

ryusage · on Aug 17, 2012

I think you're misunderstanding things a bit. In the context of DNA as a data storage medium, it no longer functions as a "building block of life". It'd be like taking some binary image data and overwriting it with binary audio data. You can still technically interpret it as image data, but it's just going to be a random jumbled mess, and will probably error out immediately.

acomar · on Aug 17, 2012

But you can also embed interesting information into an image without distorting the image itself (much). Similarly, the DNA of most organisms have a ton of garbage space that could be used to encode information without hurting the organism.

Not saying it's practical or desirable, just that it is possible.

scarmig · on Aug 17, 2012

So, we could end up with a (drum roll please) stenographic stegosaurus?

Edited to add: Ruined my own punchline. Steganographic stegosaurus, damnit!

mhb · on Aug 17, 2012

It's even better - it would be a steganographic stegosaurus.

sadga · on Aug 17, 2012

Since we are using DNA to record information of life experience, it's in fact a steganographic stegosaurus stenographer.

dontblink · on Aug 17, 2012

Leads me to wonder if the current "garbage" sequences in our own DNA could be encoded information that we were meant to decode - and by that I mean that that information may not just be entirely genetic code.

UnFleshedOne · on Aug 17, 2012

They are not garbage sequences iirc, they control gene expression. DNA is a string that is coiled tight several times over itself (can't explain better than that, see the video below :)). So the garbage sequence is like padding that pushes genes in and out of areas that can be read. Changing amount or composition(?) of the garbage can turn a gene on or off just because it becomes inaccessible to RNA replication mechanism.

https://www.youtube.com/watch?v=bW5JnYZImJA

will_work4tears · on Aug 17, 2012

It's the memories of all of our ancestors.

I must not fear, fear is the mind killer...

xanadohnt · on Aug 17, 2012

That's some seriously astute extrapolation! Nice idea, friend.

milesokeefe · on Aug 17, 2012

Education would no longer be necessary or valued.

Experience and creativity would be what everyone garners to get a job.

mey · on Aug 17, 2012

You may be interested in http://en.wikipedia.org/wiki/Culture_series

nnash · on Aug 17, 2012

I think we'd be past "jobs" if it were possible to store and read memory in DNA with the human brain.

nico · on Aug 17, 2012

Then who would be in charge of creating new knowledge, and who would put it into our brains or DNA? Knowing everything humans have discovered so far would only change the type of jobs we would have, we would still need to have jobs though.

nnash · on Aug 17, 2012

Hence the quotations, we'd be past the traditional 9 to 5 and in pursuit of loftier ambitions.

sliverstorm · on Aug 18, 2012

Your parent asks about simply storing data in DNA-form, and splicing it into an organism. Splicing computer data into your DNA does not immediately fill your head with knowledge.

If it did, why, you'd never have to study protein sequences- you'd already know how to make hemoglobin!

archangel_one · on Aug 17, 2012

At the point where we'd engaged in massive-scale genetic engineering to insert knowledge into our DNA chain, jobs may no longer be the significant driving force of our lives. Or, at least, I'd like to hope that we'd have been able to move beyond them.

tocomment · on Aug 17, 2012

I don't understand how this density could be so much better than something like flash drives. Aren't they also on the same scale of nanometers?

InclinedPlane · on Aug 17, 2012

State of the art flash drive feature density is 19nm, however an actual flash memory cell is larger than that. In contrast, a DNA base pair is about 2nm in diameter by about 3nm tall.

colanderman · on Aug 17, 2012

DNA is 3D. Flash drives are 2D.

brodney · on Aug 17, 2012

Flash drives also have all of the necessary equipment for it to be read. I wonder how much data per gram the actual storage part of the flash drive actually is. I doubt it's better than DNA, but I think it might be disingenuous for the article to be offering the analogy to hard drives without taking into account the extra read/write hardware.

skosuri · on Aug 17, 2012

agreed, it should be comparied to other arhival media like tape drives; but it's still pretty similar; we compare in our paper a hard drive platter rather than a drive itself. even then we are approximately ~million fold more dense

brodney · on Aug 17, 2012

Thanks for the response. That is a truly impressive figure. Great accomplishment!

sliverstorm · on Aug 18, 2012

They are both on the scale of nanometers, but DNA uses a couple molecules per bit. Transistors are not that small yet.

3D vs 2D is also a good point, made by another in this page.

DanBC · on Aug 17, 2012

The paper is exciting, in the calm measured way that scientists are. I look forward to seeing huge data storage on DNA in the future.

I'm gently concerned about what'll happen to information if it's not available to the future people. Is anyone taking the most important documents of our civilisation and encoding them onto clay tablets, or some such?

dsirijus · on Aug 17, 2012

Why binary if DNA naturally has 4 bits?

gliese1337 · on Aug 17, 2012

Probably because they don't have a reliable way of differentiating one strand of the double-helix from the other, so matching bases are considered to be a single unit so that it doesn't matter how you've got the pair rotated.

I imagine this could be solved by ensuring that every helix starts out with something like a byte order mark that would distinguish the two strands reliably.

Storage density could be further increased by a constant factor if more kinds of bases were used, or if they could use single-strand DNA/RNA instead (which would probably require some chemical means of ensuring that a free strand doesn't accidentally bind to something else).

skosuri · on Aug 17, 2012

i answered this above; basically to avoid certain problematic sequence features.

whatgoodisaroad · on Aug 17, 2012

Bases can only pair with one other base. Adenine can only pair with thymine. Guanine can only pair with cytosine. Knowing one base implies the other, so there are only two possible pairs.

sadga · on Aug 17, 2012

But there are two strands, and each half-pair is fixed on one side.

subrat_rout · on Aug 17, 2012

The next big hurdle is to how to develop a household DNA sequence reader under $50 that will read your storage. I mean if I want to store my data onto a DNA strand, then one day I'd be in need of reading that data at my home with the help of a sequence reader.Right?

tocomment · on Aug 17, 2012

To read the data out are they basically doing de novo assembly on the sequenced reads? How are they handling all of the errors in gene sequencing? How about assembly errors? Long repeats?

CreRecombinase · on Aug 17, 2012

If your strand length is less than or equal to the read length of your sequencer, and you have the address blocks at the start of every sequence, you don't really need to worry about assembly. Read depth and/or a checksum of some kind will take care of errors in sequencing, and with short strands or compression of some kind, long repeats aren't much of a problem either.

tocomment · on Aug 17, 2012

Thanks, that makes sense. But how to you get the address block at the start of each sequence?

skosuri · on Aug 17, 2012

it's at the start of the first read.

skosuri · on Aug 17, 2012

we don't ever assemble; we are reading 115bp (96bp data, 15bp address), and pair reads using seqprep to reduce errors; a few other things are done, but the basics is that we just take all reads at a particular barcode and call consensus by majority vote. 10 bit errors in 5.27e6

tocomment · on Aug 17, 2012

Do you don't ever make a large single molecule like you'd find inside cells? You're saying you create the fragments and then read back the same fragments?

skosuri · on Aug 17, 2012

yup. each fragment has an address that specifies it's location in the bitstream. we are lazy.

tripzilch · on Aug 18, 2012

I always have to wonder with these ExtremeTech links:

How much is this news true and how much is it the usual ExtremeTech editorialism?

For instance does DNA really last forever?

samholmes · on Aug 20, 2012

Is this storage method only good for read-only data, or can it survive multiple re-writes?

kanchax · on Aug 20, 2012

Great link. Makes one anxious to see what's next.

kschua · on Aug 18, 2012

Why do I get a feeling I am living in a Matrix as a data storage device?

mariusz331 · on Aug 18, 2012

This. Is. Awesome.

Evbn · on Aug 18, 2012

Article says they made 70billion copies of 500KB, which is quite different. Can they encode 700TB of unique data in this system?

revelation · on Aug 17, 2012

It is incredibly stable? We better don't tell evolution.

chc · on Aug 17, 2012

AFAIK, evolution doesn't occur because of some tendency of DNA to just suddenly change out from under you. It happens because of mutation and recombination, both of which are things that are done to the DNA.

sp332 · on Aug 17, 2012

I've got a few trillion cells in my body that say it's incredibly stable.

refurb · on Aug 17, 2012

Snarky reply: You've sequences every single one of those cells and confirm that the DNA all matches?

Non-snarky reply: There is a huge difference between phenotype stability and genotype stability. Take 100 cells from your body and you'll find hundreds if not thousands of genetic differences between them (single mutations).

sp332 · on Aug 17, 2012

There are 3 billion base pairs in each copy of human DNA. Looks like only ~60 of them change from one generation to the next. http://www.nytimes.com/2010/03/11/health/research/11gene.htm... Just making the 4 trillion cells in my body means that each would have to be copied an average of 41-42 times. Probably some cells get replaced more often and others have a shorter path, so I'm not sure how many times DNA is duplicated before it gets to human reproduction. But going with averages: 41 * 3,000,000,000 / 60 means 2,050,000,000 copies per error. That's a few orders of magnitude worse than a modern hard drive but not exactly shabby.

refurb · on Aug 18, 2012

The comment that spurred my comment concerned the stability of DNA.

You have to remember that offspring are only going to have the best DNA passed down to them, since many mutations would result in non-functioning gametes or non-viable offspring.

The best example I can think of spermatozoa production. DNA is copied in that process and a large % of spermatozoa are non-functional.

natrius · on Aug 17, 2012

A mutation has to happen in germ cells (the cells that produce sperm and eggs) for it to pass to the next generation.

sp332 · on Aug 17, 2012

I know. If a germ cell has DNA that has been copied 41 times (the average that I calculated), and that DNA has accumulated 60 errors, that means 1 error per 2 billion base pairs copied.

natrius · on Aug 17, 2012

Ah, I misunderstood your logic. Makes sense.

option_greek · on Aug 17, 2012

Considering that most of the mutations are random, the same ecc algorithms we currently use should ensure data integrity.

ajross · on Aug 17, 2012

Error correction redundancy to any level of reliability you want takes log(N) extra storage. My question would be what the access speeds are. If you have to read it by running it through a trillion PCR test tubes, this isn't exactly practical.

drucken · on Aug 17, 2012

Microfluidics or lab-on-a-chip are the latest advances: http://www.extremetech.com/computing/84340-new-silicon-chip-...

The progress in this field has been far better than Moore's Law [1] so it could be very practical.

[1] Excellent TED talk on it too by Richard Resnick: http://www.youtube.com/watch?v=u8bsCiq6hvM

lupatus · on Aug 17, 2012

Long term, rarely accessed backup?

jorgeleo · on Aug 17, 2012

Gel Packs! Cool! http://en.memory-alpha.org/wiki/Bio-neural_gel_pack

(How nobody has reference this?)