In The Singularity Is Near (2005) Ray Kurzweil discussed an idea for the “Document Image and Storage Invention”, or DAISI for short, but concluded it wouldn't work out. I interviewed him a few years later about this and here's what he said:
The big challenge, which I think is actually important almost philosophical challenge — it might sound like a dull issue, like how do you format a database, so you can retrieve information, that sounds pretty technical. The real key issue is that software formats are constantly changing.
People say, “well, gee, if we could backup our brains,” and I talk about how that will be feasible some decades from now. Then the digital version of you could be immortal, but software doesn’t live forever, in fact it doesn’t live very long at all if you don’t care about it if you don’t continually update it to new formats.
Try going back 20 years to some old formats, some old programming language. Try resuscitating some information on some PDP1 magnetic tapes. I mean even if you could get the hardware to work, the software formats are completely alien and [using] a different operating system and nobody is there to support these formats anymore. And that continues. There is this continual change in how that information is formatted.
I think this is actually fundamentally a philosophical issue. I don’t think there’s any technical solution to it. Information actually will die if you don’t continually update it. Which means, it will die if you don’t care about it. ...
We do use standard formats, and the standard formats are continually changed, and the formats are not always backwards compatible. It’s a nice goal, but it actually doesn’t work.
I have in fact electronic information that in fact goes back through many different computer systems. Some of it now I cannot access. In theory I could, or with enough effort, find people to decipher it, but it’s not readily accessible. The more backwards you go, the more of a challenge it becomes.
And despite the goal of maintaining standards, or maintaining forward compatibility, or backwards compatibility, it doesn’t really work out that way. Maybe we will improve that. Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.
So ironically, the most primitive formats are the ones that are easiest.
In 2005 the computing world was much more in flux than it is now.
PNG is 26 years old and basically unchanged since then. Same with 30 year old JPEG, or for those with more advanced needs the 36 year old TIFF (though there is a newer 21 year old revision). All three have stood the test of time against countless technologically superior formats by virtue of their ubiquity and the value of interoperability. The same could be said about 34 year old zip or 30 year old gzip. For executable code, the wine-supported subset of PE/WIN32 seems to be with us for the foreseeable future, even as Windows slowly drops comparability.
The latest Office365 Word version still supports opening Word97 files as well as the slightly older WordPerfect 5 files, not to mention 36 year old RTF files. HTML1.0 is 30 years old and is still supported by modern browsers. PDF has also got constant updates, but I suspect 29 year old PDF files would still display fine.
In 2005 you could look back 15 years and see a completely different computing landscape with different file formats. Look back 15 years today and not that much changed. Lots of exciting new competitors as always (webp, avif, zstd) but only time will tell whether they will earn a place among the others or go the way of JPEG2000 and RAR. But if you store something today in a format that's survived the last 25 years, you have good chances to still be able to open it in common software 50 years down the line.
This is too shortsighted by the archival standards. Even Word itself doesn't offer full compatibility. VB? 3rd party active components? Other Office software integration? It's a mess. HTML and other web formats are only readable by the virtue of being constantly evolved while keeping the backwards compatibility, which is nowhere near complete and is hardware-dependent (e.g. aspect ratios, colors, pixel densities). The standards will be pruned sooner or later, due to the tech debt or being sidestepped by something else. And I'm pretty sure there are plenty of obscure PDF features that will prevent many documents from being readable in mere half a century. I'm not even starting on the code and binaries. And cloud storage is simply extremely volatile by nature.
Even 50 years (laughable for a clay tablet) is still pretty darn long in the tech world. We'll still probably see the entire computing landscape, including the underlying hardware, changing fundamentally in 50 years.
Future-proofing anything is a completely different dimension. You have to provide the independent way to bootstrap, without relying on the unbroken chain of software standards, business/legal entities, and the public demand in certain hardware platforms/architectures. This is unfeasible for the vast majority of knowledge/artifacts, so you also have to have a good mechanism to separate signal from noise and to transform volatile formats like JPEG or machine-executable code into more or less future proof representations, at least basic descriptions of what the notable thing did and what impact it had.
>Future-proofing anything is a completely different dimension. You have to provide the independent way to bootstrap, without relying on the unbroken chain of software standards, business/legal entities, and the public demand in certain hardware platforms/architectures. This is unfeasible for the vast majority of knowledge/artifacts, so you also have to have a good mechanism to separate signal from noise and to transform volatile formats like JPEG or machine-executable code into more or less future proof representations, at least basic descriptions of what the notable thing did and what impact it had.
I'd argue that best way would be to not do that but to make sure format is ubiquitous enough that the knowledge will never be lost in the first place.
That, and use formats which can be accessed and explained concisely, like "read the first X bytes to metadata field A, then read the image payload by interpreting every three bytes as an RGB triplet until EOF" so that the information can be transmitted orally, in the off chance that becomes necessary
Hey I think I just described Windows 3.0-era PCX format :P
> HTML and other web formats are only readable by the virtue of being constantly evolved while keeping the backwards compatibility, which is nowhere near complete and is hardware-dependent (e.g. aspect ratios, colors, pixel densities).
HTML itself is relatively safe, by virtue of it being based on SGML. Though it's not ideal either because those who think it's their job to evolve HTML don't bother to maintain SGML DTDs or use other long established formal methods to keep HTML readable, but believe a hard-coded and (hence necessarily) erroneous and incomplete parsing description the size of a phone book is the right tool for the job.
Let me quote the late Yuri Rubinski's foreword to The SGML Handbook outlining the purpose of markup languages (from 1990):
> The next five years will see a revolution in computing. Users will no longer have to work at every computer task as if they had no need to share data with all their other computer tasks, they will not need to act as if the computer is simply a replacement for paper, nor will they have to appease computers or software programs that seem to be at war with one another.
However, exactly because evolving markup vocabularies requires organizing consensus, a task which W3C et al seemingly weren't up to (busy with XML, XHTML, WS-Star, and RDF-Star instead for over a decade), CSS and JS was invented and extended for the absurd purpose of basically redefining what's in the markup which itself didn't need to change, with absolute disastrous results for long-term readability or even readability on browsers other than from the browser cartel today.
> Though it's not ideal either because those who think it's their job to evolve HTML don't bother to maintain SGML DTDs or use other long established formal methods to keep HTML readable, but believe a hard-coded and (hence necessarily) erroneous and incomplete parsing description the size of a phone book is the right tool for the job.
> a task which W3C et al seemingly weren't up to (busy with XML, XHTML
You realise XML/XHTML is actually delightfully simple to parse compared to WHATWG HTML?
While it's true that these standards are X years old, the software that encoded those formats yesteryear is very different from the software that decodes it today. It's a Ship of Theseus problem. They can claim an unbroken lineage since the distant future, the year 2000, but encoders and decoders had defects and opinions that were relied on--both intentionally and unintentionally--that are different from the defects and opinions of today.
I have JPEGs and MP3s from 20 years ago that don't open today.
I've found https://github.com/ImpulseAdventure/JPEGsnoop useful to fix corruption but I haven't come across a non-standard JFIF JPEG unless it was intentionally designed to accommodate non-standard features (alpha channel etc).
I personally never encountered JPEGs or MP3s which were totally unreadable due to the being encoded by ancient software versions, but the metadata in common media formats is a total mess. Cameras and encoders are writing all sorts of obscure proprietary tags, or even things like X-Ray (STALKER Shadow of Chernobyl game engine) keeping gameplay-relevant binary metadata in OGG Vorbis comments. Which is even technically compliant with the standard I think, but that won't help you much.
In the case of individual files with non-conformant or corrupted elements it seems fairly straightforward project to build an AI model that can fix up broken files with a single click. I suspect such a thing will be widely-accessible in 10 years.
"The roots of Creo Parametric. Probably one of the last running installations of PTC's revolutionary Pro/ENGINEER Release 7 datecode 9135 installed from tape. Release 7 was released in 1991 and is - as all versions of Pro/ENGINEER - fully parametric. Files created with this version can still - directly - be opened and edited in Creo Parametric 5.0 (currently the latest version for production).
This is a raw video, no edits, and shows a bit of the original interface (menu manager, blue background, yellow and red datum planes, no modeltree).
That is great! NX has similar compatibility, though not quite as good.
In some cases, the actual version code of a feature is invoked by that data being encoded as part of the model data schema.
Literal encapsulation in action. That way bugs and output variances are preserved so that the regenerated model is accurate to what the software did decades ago.
I can't help but think bad thoughts whenever I see another "static site maker" posted on here, or a brand new way of using JavaScript to render a web page.
Talk about taking the simplest and most durable of (web) formats and creating a hellscape of tangled complexity which becomes less and less likely to be maintainable or easy to archive the more layers of hipster js faddishness you add...
One of the claimed benefits of the JVM (and obviously later VMs) was that it would solve this issue: Java programs written in 2000 should still be able to run in 2100. And as far as I know the JVM has continued to fulfill this promise.
An honest question: If you are writing a program that you want to survive for 100+ years, shouldn't you specifically target a well-maintained and well-documented VM that has backward compatibility as a top priority? What other options are there?
People routinely boot DOS in e.g. qemu. The x86 ISA is 45 years old, older if you consider the 8008/8080 part of the lineage. It's not pretty, but it's probably the most widespread backwards compatible system out there.
S/360 assembly programs probably would still run on a modern IBM mainframe.
Punched cards kept in an inert atmosphere probably would last for centuries, and along with printed documentation in archival-quality paper would allow future generations to come up with card readers and an emulator to actually run the program.
While I love the JVM, and I also think it's one of the better runtimes in terms is backwards compatibility, there have been breakages. Most of the ones I've dealt with were easy to fix. But the ease of fixing is related to the access to source code. When something in a data stream is broken, be it an MP3 or a JPEG, I guess you almost inherently need special tooling to fix it (realistically). I imagine that with an SVG it'd be easier to hand-fix it.
> An honest question: If you are writing a program that you want to survive for 100+ years, shouldn't you specifically target a well-maintained and well-documented VM that has backward compatibility as a top priority? What other options are there?
I'd be tempted to target a well-known physical machine - build a bootable image of some sort as a unikernel - although in the age of VMWare etc. there's not a huge difference.
IMO the "right" way to do this would be to endow an institution to keep the program running, including keeping it updated to the "live" version of the language it's writen in, or even porting it between languages as and when that becomes necessary.
But he seems to have written this before virtual machines became widespread.
I think the concern is becoming increasingly irrelevant now, because if I really need to access a file I created in Word 4.0 for the Mac back in 1990, it's not too hard to fire up System 6 with that version of Word and read my file. In fact it's much easier now than it was in 2005 when he was writing. Sure it might take half an hour to get it all working, but that's really not too bad.
Most of this is probably technically illegal and will sometimes even have to rely on cracked versions, but also nobody cares and. All the OS's and programs are still around and easy to find on the internet.
Not to mention that while file formats changed all the time early on, these days they're remarkably long-lived -- used for decades, not years.
The outdated hardware concern was more of a concern (as the original post illustrates), but so much of everything important we create today is in the cloud. It's ultimately being saved in redundant copies on something like S3 or Dropbox or Drive or similar, that are kept up to date. As older hardware dies, the bits are moved to newer hardware without the user even knowing.
So the problem Kurzweil talked about has basically become less of an issue as time has marched on, not more. Which is kind of nice!
>I think the concern is becoming increasingly irrelevant now, because if I really need to access a file I created in Word 4.0 for the Mac back in 1990, it's not too hard to fire up System 6 with that version of Word and read my file. In fact it's much easier now than it was in 2005 when he was writing. Sure it might take half an hour to get it all working, but that's really not too bad.
> I think the concern is becoming increasingly irrelevant
I fear we may be on top of that point. With the "cloudification" where more and more software is run on servers one doesn't control there is no way to run that software in a VM as you don't have access to the software anymore. And even getting the pure data for a custom backup becomes harder and harder.
I'm certain that 100 years from now, when the collapse really gets rolling, we'll still have cuneiform clay tablets complaining about Ea-Nassir's shitty copper but most of the digital information and culture we've created and tried to archive will be lost forever. Eventually, we're going to lose the infrastructure and knowledge base we need to keep updating everything, people will be too busy just trying to find food and fighting off mutants from the badlands to care.
Well, almost all early tablets are destroyed or otherwise lost now. Do you think we will lose virtually all digital age information within a century? Maybe from a massive CME, I suppose.
Clay tablets were usually used for temporary records, as you could erase it simply by smearing the clay a little bit (a lot easier than writing with on papyrus). The tablets we have exist because of something that causes the clay to be baked into ceramic, which is generally some sort of catastrophic fire that caused the records to accidentally be preserved for much longer.
I know. My first iPad just stopped powering up. WTF!
I should etch something into its glass and bury it in my back yard. Perhaps a shopping list, or a complaint about how my neighbor inexplicably gets into his truck six or eight times a day and just sits there with it running.
I can see it happening. Not as a single catastrophic event but, like Rome falling bit by bit, our technological civilization fails and degenerates as climate change (in the worst possible scenario) wreaks havoc on everything.
I was able to backup/restore an old COBOL system via cpio between modern GNU cpio (man page last updated June 2018), and SCO's cpio (c. 1989). This is neither to affirm nor contradict Kurzweil, but rather to praise the GNU userland for its solid legacy support.
This is very very true. I have archived a number of books and magazines that were scanned and converted into "simplified" PDF, and archived on a DVD disks with C source code.
There are external dependencies but one hopes that the descriptions are sufficient to figure out how to make those work.
Actually I'd argue it's wrong precisely because we do manage to retrieve even such old artifacts. Only problem is that nobody cared for 30 years so the process was harder than it should be but in the end it was possible.
Sure, there is a risk that at some point, for example, any version of every PNG or H.264 decoder gets lost and so re-creating decoder for that would be significantly more complicated, but chances for that are pretty slim, but looking at `ffmpeg -codecs` I'm not really worried for that to ever happen.
> Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.
The big challenge, which I think is actually important almost philosophical challenge — it might sound like a dull issue, like how do you format a database, so you can retrieve information, that sounds pretty technical. The real key issue is that software formats are constantly changing.
People say, “well, gee, if we could backup our brains,” and I talk about how that will be feasible some decades from now. Then the digital version of you could be immortal, but software doesn’t live forever, in fact it doesn’t live very long at all if you don’t care about it if you don’t continually update it to new formats.
Try going back 20 years to some old formats, some old programming language. Try resuscitating some information on some PDP1 magnetic tapes. I mean even if you could get the hardware to work, the software formats are completely alien and [using] a different operating system and nobody is there to support these formats anymore. And that continues. There is this continual change in how that information is formatted.
I think this is actually fundamentally a philosophical issue. I don’t think there’s any technical solution to it. Information actually will die if you don’t continually update it. Which means, it will die if you don’t care about it. ...
We do use standard formats, and the standard formats are continually changed, and the formats are not always backwards compatible. It’s a nice goal, but it actually doesn’t work.
I have in fact electronic information that in fact goes back through many different computer systems. Some of it now I cannot access. In theory I could, or with enough effort, find people to decipher it, but it’s not readily accessible. The more backwards you go, the more of a challenge it becomes.
And despite the goal of maintaining standards, or maintaining forward compatibility, or backwards compatibility, it doesn’t really work out that way. Maybe we will improve that. Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.
So ironically, the most primitive formats are the ones that are easiest.