I'm an author of the paper. The title of this article is misleading; first, we encoded 650kB and made 70 billion copies... second, those 70 billion copies weigh 1 milligram... third, it's really only meant for archival purposes as it's immutable and not random access... fourth, it's expensive right now (at least this might be a solvable problem).
Nice work, and thanks for replying to all the questions here. I didn't read the paper, maybe it is addressed in there, but how much did it cost to synthesize that much DNA? Also, it could be random access if you PCR amplified the fragments you need based on the barcodes - you could even make a FAT (file allocation TUBE) which has all the file names and their barcodes.
thank you so much for the clarifications. I read the ScienceExpress writeup and was quite puzzled by where the 700TB were hiding. "You could store a hell of a lot of FASTQ reads in 700TB of DNA... hey, wait a second..."
I really liked their paper. Its a bit less over the top than the extremetech guys but hey, that is the difference between pop journalism and science.
Clearly with some form of fountain code or LDPC codes you will be able to get the data back, but what struck me is that I always thought of DNA as relatively unstable, in the sense that cells decay/die etc, but the fact that just sitting there, DNA which isn't expressing various proteins under the influence of other cellular mechanisms, well it just sits there. That was new for me.
When I showed it to my wife she pointed out that the sourdough starter she has been using since we were married was from her grandmother, I joked that the next megaupload type raid would have to sequence all the DNA the found in a place to figure out if Shrek3 was encoded in it somewhere. That would be painfully funny I think.
well the dna we used never touched the inside of a cell; plain dried dna is very stable, evidenced by the ability to sequence 10's of thousands of years old samples stored in decaying flesh (albeit with some errors).
A bit of support to that: it used to be common to send plasmids (circular strand of DNA) to people on paper. You'd take a drop of the plasmid, put it onto the paper, and let it dry. When the recipient got the card in the mail, they'd cut out around your dried plasmid and put it back into water. It would be perfectly functional, if needed.
If you needed more, you could then transform E. coli with the plasmid and let them do the work for you.
I don't know if this is done very much anymore, but this helps to show just how robust DNA can be. I want to say that a few years ago, someone tried to sell our lab an archival tool for DNA that was essentially this. You dried your DNA samples onto blotting paper in a grid, then when you needed a sample, you could go back and reconstitute by punching it out of the paper.
If you store data onto 50 DNA strands, can you always read back all the data from all 50 strands, or does one need to store multiple copies of each in case the sequencer can't "find" a particular strand?
If one does need multiple copies, it would seem that this method suffers from the coupon collector's problem [1] (i.e. to collect all 50 strands requires collecting 225 random strands on average), and that the retrieval rate could be improved by using a fountain code [2], which allows each strand to simultaneously encode data at multiple addresses, which would decrease the number of strands required to be sampled to only slightly more than the number of strands worth of data requested.
To sequence we used about ~100x synthetic coverage on average and ~1000x sequencing coverage; so that's a whole lot of coverage. even then we did have 10 bit errors, but all the data blocks were recovered
we printed the dna microarray using agilent's ink jet process (agilent is a spin off of hp; imagine an inkjet with actg instead of cmyk). each spot on the array has many hundreds of thousands of molecules. after we cleave the dna off the array, we take a portion of that, amplify and the sequence it. the portion we took off we estimate ~100 molecules for every oligo we made (55,000). we the sequence to get >55 million reads (so 1000x coverage); but these are just averages, and the distribution varies. you can check out the supplement of the paper if you are interested.
Does anybody know how to escape their horrible "mobile" version that they force onto ipad users? It can't even be zoomed :-(
More and more often I find myself not reading articles because someone thought it would be a great idea to create a non-scrolling, non-obvious, paginated "iPad format" with additional misleading and unintuitive buttons looking like native ones but doing something different.
I'm considering buying an iPad, and I have a question: is there no browser on the iPad that allows you to choose whether you want to the mobile version or the regular version? If not, that's almost a deal breaker.
Yes those browsers are available. Chrome has a "request desktop site" option and my personal favourite, iCab Mobile, has an extensive list of agent strings (custom ones can be added as well) that can be set.
I honestly haven't run into a site (except this one) that forces mobile sites on you without allowing you to opt out - however, I also use Chrome, so my being a mobile device may not be properly detected yet.
And, of course, this brings us to the question: Do we already have messages in our DNA? Here's a post (from 2007) on this: http://blog.sciencefictionbiology.com/2007/07/messages-in-ou.... Actually, if it's from the aliens who seeded life on Earth, it would probably be in a prokaryotic DNA perhaps?
Probably not... If we were seeded by aliens, then the message would have had to have been in a very primitive form, so it would necessarily be seen in all living things, including prokaryotes. Unfortunately bacterial genomes are much, much, smaller than humans, so there isn't much room to waste for hidden alien messages. Additionally, presumably, these messages wouldn't be functional, so they would be under any sort of evolutionary pressure. This means that they would likely be mutated or lost to natural selection.
That's not to say that we weren't seeded, just that if we were, any message would likely have been lost. Unless... maybe they seeded mitochondria intact, which are in all (ok, most) eukaryotes. Maybe that might work... :)
Unfortunately, chrM is pretty small too... so no hidden messages there either :(
This article is incredibly misleading. First of all there is an inconsistency. The headline says they stored 700 terrabytes (4.4 petabytes). It then later says that they actually stored 700 kilobytes (Their book) and that they did made 70 billion copies (44 petabytes?). The main thing is that storing 700 kilobytes and then making 700 billion copies is considerably less useful than storing 70 billion terabytes outright. Aside from that though, this is awesome, and a huge step forward into promising and uncharted territory.
in the paper, we say 1.5mg per petabyte at large scales. we only encode 650kB or so. this seems a little sensationalistic. we are far away from being able to do 1 petabyte of arbitrary information.
Wait, 1.5mg per petabyte at large scales? Wouldn't that mean a gram could hold (1000/1.5) 667 petabytes and presumably scalable to many grams (eventually)? I understand it's only 650kB right now, but the density is obviously still incredible.
i think you have the numbers wrong. at least in the supplement of our paper we say a petabyte would weigh ~1.5 milligram. It would be far too expensive to do that though; about 6-8 orders of magnitude increase in scale is necessary from current technologies. that said, we've seen that kind of drop over the last decade or so; here's to keeping it going.
we didn't because we wanted to avoid particular sequence features that are difficult to synthesize and sequence. we probably could have gotten away with something like 1.8 bits per base, but we were already doing fine on density, so we thought a 2x hit wouldn't be that bad.
You just encode a big marker (making sure it's not a palindrome-paired version of itself!) as a header. If you see that, it's a correct order. If not, it's not.
This header idea is great because then you only need to keep one strand and can toss the other, potentially quadrupling the amount of data storage (I'm assuming you can keep single strands of DNA stable).
[Left strand]
A = 00
T = 01
C = 10
G = 11
[Right strand]
T = 00
A = 01
G = 10
C = 11
Anyone know these guys at Harvard, b/c this might be a way to put, at most, 2800 terabytes in a gram? (I don't know how long the header sequences would have to be).
It's possible to have single stranded DNA, but you'd have problems with error correction.
Let's say DNA breaks, or some errors appear in the code. Thanks to the double stranded structure it's "quite easy" to repair the code.
Besides that, it's not the density which is a problem right now, but the access speed. The amount of data in DNA is so immense that doubling the density won't give any practical improvements for decades to come - if ever.
Having said that, if I'm not mistaken, some viruses are encoded by single stranded DNA & ssRNA. I'm not sure, but the density might be the reason for that.
I don't think you can simply toss one of the strands. DNA is so compact because of the way it coils, and you likely lose that if you only have one strand.
that's irrelevant since you have 2 strands that are mirror copies of each other. just prefix your data with a single A, if it's read as a T, just invert the bits of the rest of the strand.
Ah, interesting. In natural DNA this is achieved by using a more complex encoding scheme (3 base pairs -> one amino acid), combined with the vast majority of DNA not encoding genes directly.
I notice that the article fails to mention how long it would take to extract all 700 terabytes of data...
Assuming 5.5 petabits stored with 1 base pair representing 1 bit, we can extrapolate the time required to extract the data based off the time taken to sequence the human genome (3 billion base pairs).
5.5 petabits / 3 billion bits ~= 2 million, so theoretically it should take 2 million times longer to sequence the original.
3 years ago, there was an Ars Technica article about how it now only takes 1 month to sequence a human genome[1]; the article now claims that microfluidic chips can perform the same task in hours.
Assuming 2 hours (low end) to sequence the human genome:
2 hours * 2 million = 4 million hours = 456 years, give or take a few years.
So, maybe not so great for storing enormous amounts of data. But if you want to store 1 GB, it would only take ~6 hours. Not too bad.
tldr; we are 6-8 orders of magnitude away from doing petabytes routinely; that said costs of sequencing/synthesis have seen such drops over the last decade or so. there are many barriers though for that continuing for another decade
I think that DNA, if ever used in practice, will be primarily used for long-term, archival storage of data that's rarely accessed, if ever. However there are a couple of things that alleviate the problems with reading the DNA.
1) DNA sequencing technology is currently advancing much faster than silicon technology, so give it time and it's likely that it will catch up with current hard disk reading speed at comparable sizes and volumes.
2) DNA's self-hybridizing nature makes it easy to pull out blocks with specific addresses (if you wait for the hybridization). So if you include address labels in the DNA as you write it out, you can probably pull it out in chunks of kilobase to megabase at a time.
3) As the other commenter pointed out, this is extremely easy to parallelize. So if you want to go twice as fast, divide the sample in half put half in each machine. Dilute and pipette as necessary.
Yes, the microfluidics used today make use of this for reading large numbers of small segments of DNA in parallel. The current "gold standard" for DNA sequencing (manufactured by Illumina) uses millions of tiny fragments of DNA which are read optically as DNA sequence is extended.
They're teaching us this in ComputerScience and I wonder if this is total crap or not. Can you please shed light into this?
"In humans, the deoxyribonucleic acid (DNA, Germ. DNS) is the carrier of genetic information, and the main constituent of the chromosomes.
DNA is a chain-like polymer of nucleotides, which differ in their nitrogen bases (Thymin/Cytosin bzw. Adenin/Guanin,)
The alphabet of the code is therefore: {Thymin, Cytosin, Adenin, Guanin,} or also { T, C, A, G }
Three consecutive bases form a word
So there are 43 = 64 combinations per word
so the word length is ld (64) bits = 6 bits
A gene contains about 200 words
A chromosome contains about 104 to 105 genes
The number of chromosomes per cell nucleus is 46 in humans
The stored data per nucleus have, a volume of 6 bit * 200 * 10^5 * 46 = 55200 bit * 10^5 * 5 * 10^9 bit * * 10^9 Byte = 1 GByte"
You aren't far off, but there are some errors. Excuse in advance any errors below, it's been a while since I did this stuff daily.
Human genome is 23 chromosomes, 2 copies of each (46). They are not mirror copies like RAID1, but rather alleles (different versions of same gene). So you might have a different version of a gene on each half of the chromosome. This is the basis for sexual selection and why sexual organisms can evolve (esp at the population level) so much faster than asexual ones. The two alleles can be the same as well. It's called homozygous or heterozygous. This is basic Mendelian genetics. Most genes work this way, though many are more complex.
ATGC is correct, so each "bit" is base 4.
Genes encode proteins; every 3 base pairs is a codon which specifies which Amino Acid to use. While there could be 4^3=64 in practice there are only 20 amino acids used in nature to make functional proteins. Genes vary greatly in length, not sure where you got ~200 codons/gene but that's terribly wrong. Maybe close to average, but the range is large. In any case, for data storage, anything relating to codons and proteins would be irrelevant.
Also in practice not all of a chromosome encodes proteins. There is often lots of buffer region between genes, not to mention a lot of control flow sequences that help control expression of genes. Beyond that, the ends of chromosomes don't have many genes, they just contain pseudo-random information and are still being explored (junk dna, telomeres, etc).
>> And how much data did they fit into one DNA pair?
Info: According to Quora the avg. weight of the Human DNA is 60g, that means a Human could carry 700TB * 60g = 42000 TB
Note: 700TB = 7.69658139 × 10^14 bytes
>To store the same kind of data on hard drives — the densest storage medium in use today — you’d need 233 3TB drives, weighing a total of 151 kilos.
But hard drives aren't the densest storage medium in use today. A microSD card can hold up to 64 gigabytes and is 0.5 grams. 700 terabytes would be only 5.6 kilograms.
Can we encode all of human knowledge into the DNA of some organism? How can organisms access data stored in their DNA? Imagine being born with knowledge of every Wikipedia article, or even every website. What would that be like?
I think you're misunderstanding things a bit. In the context of DNA as a data storage medium, it no longer functions as a "building block of life". It'd be like taking some binary image data and overwriting it with binary audio data. You can still technically interpret it as image data, but it's just going to be a random jumbled mess, and will probably error out immediately.
But you can also embed interesting information into an image without distorting the image itself (much). Similarly, the DNA of most organisms have a ton of garbage space that could be used to encode information without hurting the organism.
Not saying it's practical or desirable, just that it is possible.
Leads me to wonder if the current "garbage" sequences in our own DNA could be encoded information that we were meant to decode - and by that I mean that that information may not just be entirely genetic code.
They are not garbage sequences iirc, they control gene expression. DNA is a string that is coiled tight several times over itself (can't explain better than that, see the video below :)). So the garbage sequence is like padding that pushes genes in and out of areas that can be read. Changing amount or composition(?) of the garbage can turn a gene on or off just because it becomes inaccessible to RNA replication mechanism.
Then who would be in charge of creating new knowledge, and who would put it into our brains or DNA? Knowing everything humans have discovered so far would only change the type of jobs we would have, we would still need to have jobs though.
Your parent asks about simply storing data in DNA-form, and splicing it into an organism. Splicing computer data into your DNA does not immediately fill your head with knowledge.
If it did, why, you'd never have to study protein sequences- you'd already know how to make hemoglobin!
At the point where we'd engaged in massive-scale genetic engineering to insert knowledge into our DNA chain, jobs may no longer be the significant driving force of our lives. Or, at least, I'd like to hope that we'd have been able to move beyond them.
State of the art flash drive feature density is 19nm, however an actual flash memory cell is larger than that. In contrast, a DNA base pair is about 2nm in diameter by about 3nm tall.
Flash drives also have all of the necessary equipment for it to be read. I wonder how much data per gram the actual storage part of the flash drive actually is. I doubt it's better than DNA, but I think it might be disingenuous for the article to be offering the analogy to hard drives without taking into account the extra read/write hardware.
agreed, it should be comparied to other arhival media like tape drives; but it's still pretty similar; we compare in our paper a hard drive platter rather than a drive itself. even then we are approximately ~million fold more dense
The paper is exciting, in the calm measured way that scientists are. I look forward to seeing huge data storage on DNA in the future.
I'm gently concerned about what'll happen to information if it's not available to the future people. Is anyone taking the most important documents of our civilisation and encoding them onto clay tablets, or some such?
Probably because they don't have a reliable way of differentiating one strand of the double-helix from the other, so matching bases are considered to be a single unit so that it doesn't matter how you've got the pair rotated.
I imagine this could be solved by ensuring that every helix starts out with something like a byte order mark that would distinguish the two strands reliably.
Storage density could be further increased by a constant factor if more kinds of bases were used, or if they could use single-strand DNA/RNA instead (which would probably require some chemical means of ensuring that a free strand doesn't accidentally bind to something else).
Bases can only pair with one other base. Adenine can only pair with thymine. Guanine can only pair with cytosine. Knowing one base implies the other, so there are only two possible pairs.
The next big hurdle is to how to develop a household DNA sequence reader under $50 that will read your storage. I mean if I want to store my data onto a DNA strand, then one day I'd be in need of reading that data at my home with the help of a sequence reader.Right?
To read the data out are they basically doing de novo assembly on the sequenced reads? How are they handling all of the errors in gene sequencing? How about assembly errors? Long repeats?
If your strand length is less than or equal to the read length of your sequencer, and you have the address blocks at the start of every sequence, you don't really need to worry about assembly. Read depth and/or a checksum of some kind will take care of errors in sequencing, and with short strands or compression of some kind, long repeats aren't much of a problem either.
we don't ever assemble; we are reading 115bp (96bp data, 15bp address), and pair reads using seqprep to reduce errors; a few other things are done, but the basics is that we just take all reads at a particular barcode and call consensus by majority vote. 10 bit errors in 5.27e6
Do you don't ever make a large single molecule like you'd find inside cells? You're saying you create the fragments and then read back the same fragments?
AFAIK, evolution doesn't occur because of some tendency of DNA to just suddenly change out from under you. It happens because of mutation and recombination, both of which are things that are done to the DNA.
Snarky reply: You've sequences every single one of those cells and confirm that the DNA all matches?
Non-snarky reply: There is a huge difference between phenotype stability and genotype stability. Take 100 cells from your body and you'll find hundreds if not thousands of genetic differences between them (single mutations).
There are 3 billion base pairs in each copy of human DNA. Looks like only ~60 of them change from one generation to the next. http://www.nytimes.com/2010/03/11/health/research/11gene.htm... Just making the 4 trillion cells in my body means that each would have to be copied an average of 41-42 times. Probably some cells get replaced more often and others have a shorter path, so I'm not sure how many times DNA is duplicated before it gets to human reproduction. But going with averages: 41 * 3,000,000,000 / 60 means 2,050,000,000 copies per error. That's a few orders of magnitude worse than a modern hard drive but not exactly shabby.
The comment that spurred my comment concerned the stability of DNA.
You have to remember that offspring are only going to have the best DNA passed down to them, since many mutations would result in non-functioning gametes or non-viable offspring.
The best example I can think of spermatozoa production. DNA is copied in that process and a large % of spermatozoa are non-functional.
I know. If a germ cell has DNA that has been copied 41 times (the average that I calculated), and that DNA has accumulated 60 errors, that means 1 error per 2 billion base pairs copied.
Error correction redundancy to any level of reliability you want takes log(N) extra storage. My question would be what the access speeds are. If you have to read it by running it through a trillion PCR test tubes, this isn't exactly practical.